Reward models play a crucial role in enhancing large language models, especially in reinforcement learning from human feedback or inference-time verification.
Current reward modeling methods primarily rely on overall response scores for learning outcome rewards, which limits generalization on unseen responses.
A new approach is proposed in this paper that utilizes generation probabilities to establish intra-trajectory consistency in the response trajectory.
This approach allows for fine-grained signals to propagate across processes, aiding in reward learning.
An intra-trajectory consistency regularization is developed to ensure consistent rewards between adjacent processes with higher next-token generation probabilities.
The proposed regularization is applied to an advanced outcome reward model, leading to improved performance on RewardBench.
The reward model trained with the new regularization demonstrates better DPO-aligned policies and achieves improved best-of-N (BON) inference-time verification results.
The code for the proposed approach is available at https://github.com/chaoyang101/ICRM.