menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Probabilis...
source image

Arxiv

3d

read

116

img
dot

Image Credit: Arxiv

Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

  • Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models.
  • The Probabilistic Uncertain Reward Model (PURM) is proposed as a natural generalization of the classical Bradley-Terry reward model.
  • PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty.
  • Experiments demonstrate that PURM significantly delays reward hacking and improves final reward performance.

Read Full Article

like

7 Likes

For uninterrupted reading, download the app