Researchers introduce Athena-PRM, a multimodal process reward model for evaluating reward scores in complex reasoning problems efficiently.
Conventional methods for creating high-performance PRMs require time-consuming step-level annotations, leading to financial investments.
Athena-PRM leverages prediction consistency between weak and strong completers to generate high-quality process-labeled data effectively.
With just 5,000 samples, Athena-PRM shows remarkable effectiveness across different scenarios and benchmarks.
Two strategies, ORM initialization and up-sampling for negative data, are developed to boost PRM performance.
The approach is validated in verification, direct evaluation of reasoning step correctness, and reward ranked fine-tuning scenarios.
Athena-PRM consistently achieves superior performance across various benchmarks, enhancing performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling.
It sets the state-of-the-art results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, demonstrating accurate reasoning step assessment.
Athena-7B, developed using Athena-PRM as the reward model, surpasses baseline performance significantly on five benchmarks.