<ul><li>Inference-time computation provides an important axis for scaling language model performance.</li><li>Naively scaling compute through techniques like Best-of-$N$ sampling can cause performance degradation due to reward hacking.</li><li>Theoretical analysis of inference-time alignment algorithms reveals the importance of the pre-trained policy's coverage for performance and compute scaling.</li><li>The introduction of $	exttt{InferenceTimePessimism}$ algorithm mitigates reward hacking and exhibits optimal performance and scaling-monotonic characteristics.</li></ul>

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Discover more