Direct Alignment Algorithms (DAAs) like Direct Preference Optimization (DPO) are being used as alternatives to Reinforcement Learning from Human Feedback for aligning large language models with human values.
These methods are prone to over-optimization, causing the model to deviate from the reference policy, leading to decreased performance during training.
A new approach called Importance-Sampling DAAs (IS-DAAs) is introduced to address the over-optimization issue in offline DAAs by multiplying the objective function with an importance ratio based on the reference policy distribution.
Experiments show that IS-DAAs effectively mitigate over-training concerns, particularly with low regularization strength, outperforming other methods targeting this problem.