menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Mitigating...
source image

Arxiv

5d

read

395

img
dot

Image Credit: Arxiv

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

  • Direct Alignment Algorithms (DAAs) like Direct Preference Optimization (DPO) are being used as alternatives to Reinforcement Learning from Human Feedback for aligning large language models with human values.
  • These methods are prone to over-optimization, causing the model to deviate from the reference policy, leading to decreased performance during training.
  • A new approach called Importance-Sampling DAAs (IS-DAAs) is introduced to address the over-optimization issue in offline DAAs by multiplying the objective function with an importance ratio based on the reference policy distribution.
  • Experiments show that IS-DAAs effectively mitigate over-training concerns, particularly with low regularization strength, outperforming other methods targeting this problem.

Read Full Article

like

23 Likes

For uninterrupted reading, download the app