menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Inference-...
source image

Arxiv

1M

read

18

img
dot

Image Credit: Arxiv

Inference-Time Scaling for Generalist Reward Modeling

  • Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale.
  • Improving reward modeling (RM) with more inference compute for general queries is investigated.
  • Self-Principled Critique Tuning (SPCT) is proposed to foster scalable reward generation behaviors in generative reward modeling (GRM).
  • The study shows that SPCT improves the quality and scalability of GRMs, outperforming existing methods and models.

Read Full Article

like

1 Like

For uninterrupted reading, download the app