menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Intra-Traj...
source image

Arxiv

2d

read

140

img
dot

Image Credit: Arxiv

Intra-Trajectory Consistency for Reward Modeling

  • Reward models play a crucial role in enhancing large language models, especially in reinforcement learning from human feedback or inference-time verification.
  • Current reward modeling methods primarily rely on overall response scores for learning outcome rewards, which limits generalization on unseen responses.
  • A new approach is proposed in this paper that utilizes generation probabilities to establish intra-trajectory consistency in the response trajectory.
  • This approach allows for fine-grained signals to propagate across processes, aiding in reward learning.
  • An intra-trajectory consistency regularization is developed to ensure consistent rewards between adjacent processes with higher next-token generation probabilities.
  • The proposed regularization is applied to an advanced outcome reward model, leading to improved performance on RewardBench.
  • The reward model trained with the new regularization demonstrates better DPO-aligned policies and achieves improved best-of-N (BON) inference-time verification results.
  • The code for the proposed approach is available at https://github.com/chaoyang101/ICRM.

Read Full Article

like

8 Likes

For uninterrupted reading, download the app