<ul><li>Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving.</li><li>Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis.</li><li>However, the evaluation shows that Self-Consistency (SC) is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets.</li><li>The work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification.</li></ul>

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Discover more