Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving.
Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis.
However, the evaluation shows that Self-Consistency (SC) is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets.
The work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification.