MPPO is a new algorithm for preference optimization in large language models (LLMs) with arbitrary negative samples.Existing methods like DPO and KTO rely heavily on abundant preference data and require a reference model.MPPO leverages the average likelihood of model responses to fit the reward function, maximizing the utilization of preference data.Experimental results show that MPPO outperforms other methods like DPO and ORPO across various benchmarks.