Mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face challenges in Preference Alignment (PA) for Large Language Models (LLMs).
High-quality datasets of positive preference examples are costly and computationally intensive, limiting their use in low-resource scenarios.
LLM unlearning technique presents a promising alternative by directly removing the influence of negative examples.
A framework called Unlearning to Align (U2A) is proposed to optimize the selection and unlearning of negative examples for improved PA performance.