<ul data-eligibleForWebStory="true"><li>Reward models play a crucial role in enhancing large language models, especially in reinforcement learning from human feedback or inference-time verification.</li><li>Current reward modeling methods primarily rely on overall response scores for learning outcome rewards, which limits generalization on unseen responses.</li><li>A new approach is proposed in this paper that utilizes generation probabilities to establish intra-trajectory consistency in the response trajectory.</li><li>This approach allows for fine-grained signals to propagate across processes, aiding in reward learning.</li><li>An intra-trajectory consistency regularization is developed to ensure consistent rewards between adjacent processes with higher next-token generation probabilities.</li><li>The proposed regularization is applied to an advanced outcome reward model, leading to improved performance on RewardBench.</li><li>The reward model trained with the new regularization demonstrates better DPO-aligned policies and achieves improved best-of-N (BON) inference-time verification results.</li><li>The code for the proposed approach is available at https://github.com/chaoyang101/ICRM.</li></ul>

Intra-Trajectory Consistency for Reward Modeling

Discover more