<ul><li>Direct Alignment Algorithms (DAAs) like Direct Preference Optimization (DPO) are being used as alternatives to Reinforcement Learning from Human Feedback for aligning large language models with human values.</li><li>These methods are prone to over-optimization, causing the model to deviate from the reference policy, leading to decreased performance during training.</li><li>A new approach called Importance-Sampling DAAs (IS-DAAs) is introduced to address the over-optimization issue in offline DAAs by multiplying the objective function with an importance ratio based on the reference policy distribution.</li><li>Experiments show that IS-DAAs effectively mitigate over-training concerns, particularly with low regularization strength, outperforming other methods targeting this problem.</li></ul>

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Discover more