Reasoning has become a significant focus for language models, but the progress often lacks methodological rigor and robust evaluation practices.
Current mathematical reasoning benchmarks are sensitive to various implementation choices, leading to unclear comparisons and unreported sources of variance.
A standardized evaluation framework with clear best practices and reporting standards is proposed to address these issues.
Reassessment of recent methods reveals that reinforcement learning approaches show modest improvements and are prone to overfitting, while supervised fine-tuning methods demonstrate stronger generalization.