<ul><li>Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models.</li><li>The Probabilistic Uncertain Reward Model (PURM) is proposed as a natural generalization of the classical Bradley-Terry reward model.</li><li>PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty.</li><li>Experiments demonstrate that PURM significantly delays reward hacking and improves final reward performance.</li></ul>

Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

Discover more