<ul><li>Large language models often fine-tune responses to human preferences using reinforcement learning from human feedback (RLHF).</li><li>Direct preference optimization (DPO) and related methods eliminate the need for a separate reward training step by inducing an implicit reward through a reparameterization trick.</li><li>While DPO-based objectives have shown success, they can suffer from sub-optimal regularization and counter-intuitive interpolation behaviors due to reparameterizations.</li><li>A new explicit preference optimization framework called EXPO has been introduced, requiring no reparameterization and offering regularization factors that avoid pitfalls of DPO variants.</li></ul>

Explicit Preference Optimization: No Need for an Implicit Reward Model

Discover more