Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

A naukri.com initiative

New

Towards Br...

Arxiv

Image Credit: Arxiv

Direct Alignment Algorithms (DAAs) like Direct Preference Optimization and Simple Preference Optimization have become efficient alternatives to Reinforcement Learning from Human Feedback algorithms for aligning large language models with human preferences.
DAAs face a limitation known as the 'reward-generation gap,' which represents a misalignment between training optimization objectives and generation performance during inference.
One contributor to the reward-generation gap is the discrepancy in how prefix tokens' importance affects LLM generation and the implicit reward functions of DAAs.
To address this gap, a method called Prefix-Oriented Equal-length Training (POET) is introduced, which truncates both preferred and dispreferred responses to match the shorter one's length.
By training with POET, the optimization of DAAs is constrained to converge across all positions, paying more attention to prefix tokens compared to standard DAAs.
Experiments with DPO and SimPO, two typical DAAs, show that POET enhances their performance, yielding improvements of up to 15.6 points in AlpacaEval 2 and overall enhancements in downstream tasks.
The study emphasizes the significance of addressing the gap between reward optimization and generation performance in DAAs.

Read Full Article

3 Likes

For uninterrupted reading, download the app