menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Explicit P...
source image

Arxiv

2d

read

306

img
dot

Image Credit: Arxiv

Explicit Preference Optimization: No Need for an Implicit Reward Model

  • Large language models often fine-tune responses to human preferences using reinforcement learning from human feedback (RLHF).
  • Direct preference optimization (DPO) and related methods eliminate the need for a separate reward training step by inducing an implicit reward through a reparameterization trick.
  • While DPO-based objectives have shown success, they can suffer from sub-optimal regularization and counter-intuitive interpolation behaviors due to reparameterizations.
  • A new explicit preference optimization framework called EXPO has been introduced, requiring no reparameterization and offering regularization factors that avoid pitfalls of DPO variants.

Read Full Article

like

18 Likes

For uninterrupted reading, download the app