<ul><li>Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale.</li><li>Improving reward modeling (RM) with more inference compute for general queries is investigated.</li><li>Self-Principled Critique Tuning (SPCT) is proposed to foster scalable reward generation behaviors in generative reward modeling (GRM).</li><li>The study shows that SPCT improves the quality and scalability of GRMs, outperforming existing methods and models.</li></ul>

Inference-Time Scaling for Generalist Reward Modeling

Discover more