Large language models often fine-tune responses to human preferences using reinforcement learning from human feedback (RLHF).
Direct preference optimization (DPO) and related methods eliminate the need for a separate reward training step by inducing an implicit reward through a reparameterization trick.
While DPO-based objectives have shown success, they can suffer from sub-optimal regularization and counter-intuitive interpolation behaviors due to reparameterizations.
A new explicit preference optimization framework called EXPO has been introduced, requiring no reparameterization and offering regularization factors that avoid pitfalls of DPO variants.