<ul><li>Optimizing policies based on human preferences is crucial for aligning language models with human intent.</li><li>This work proposes a framework for robust policy optimization under noisy preferences by viewing reward modeling as a classification problem.</li><li>The framework leverages symmetric losses, known for their robustness to label noise in classification, leading to the Symmetric Preference Optimization (SymPO) method.</li><li>Experiments conducted on synthetic and real-world tasks show the effectiveness of SymPO in successful policy optimization even with noisy labels.</li></ul>

On Symmetric Losses for Robust Policy Optimization with Noisy Preferences

Discover more