Reinforcement learning from human feedback (RLHF) is effective for aligning large language models with human preferences.
Challenges in RLHF include constructing accurate reward signals using Bradley-Terry reward models (BT RMs) which can be sensitive to data size and vulnerable to reward hacking.
Think-RM is introduced as a training framework enabling long-horizon reasoning in Generative Reward Models (GenRMs) by modeling an internal thinking process.
Think-RM, combined with a novel pairwise RLHF pipeline, achieves state-of-the-art results on RM-Bench and demonstrates superior end-policy performance compared to traditional approaches.