Modeling human preferences is essential for aligning foundation models with human values.
Traditional reward modeling methods like the Bradley-Terry model have limitations in expressing complex preferences, especially in handling intransitive preferences.
This study introduces preference embedding, which involves embedding responses into a latent space to efficiently capture intricate preference structures with linear query complexity.
The General Preference Optimization (GPO), based on preference scores, is proposed to generalize reward-based reinforcement learning from human feedback (RLHF).
Experimental results demonstrate that the General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences.
Evaluation on tasks like AlpacaEval2.0 after language model post-training with GPO and the general preference model shows performance enhancements over BT models.
The method seems promising in enhancing the alignment of foundation models with diverse human values, indicating potential for improvement over existing models.
The code for this model is available at https://github.com/general-preference/general-preference-model.