Optimizing policies based on human preferences is crucial for aligning language models with human intent.
This work proposes a framework for robust policy optimization under noisy preferences by viewing reward modeling as a classification problem.
The framework leverages symmetric losses, known for their robustness to label noise in classification, leading to the Symmetric Preference Optimization (SymPO) method.
Experiments conducted on synthetic and real-world tasks show the effectiveness of SymPO in successful policy optimization even with noisy labels.