<ul><li>MPPO is a new algorithm for preference optimization in large language models (LLMs) with arbitrary negative samples.</li><li>Existing methods like DPO and KTO rely heavily on abundant preference data and require a reference model.</li><li>MPPO leverages the average likelihood of model responses to fit the reward function, maximizing the utilization of preference data.</li><li>Experimental results show that MPPO outperforms other methods like DPO and ORPO across various benchmarks.</li></ul>

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Discover more