Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models.The Probabilistic Uncertain Reward Model (PURM) is proposed as a natural generalization of the classical Bradley-Terry reward model.PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty.Experiments demonstrate that PURM significantly delays reward hacking and improves final reward performance.