Inference-time computation provides an important axis for scaling language model performance.
Naively scaling compute through techniques like Best-of-$N$ sampling can cause performance degradation due to reward hacking.
Theoretical analysis of inference-time alignment algorithms reveals the importance of the pre-trained policy's coverage for performance and compute scaling.
The introduction of $ exttt{InferenceTimePessimism}$ algorithm mitigates reward hacking and exhibits optimal performance and scaling-monotonic characteristics.