Process-supervised reward models (PRMs) offer fine-grained, step-wise feedback on model responses, aiding in selecting effective reasoning paths for complex tasks.
Existing reward benchmarks primarily focus on text-based models, with some specifically designed for PRMs. In the vision-language domain, evaluation methods generally assess broad model capabilities.
Researchers from UC Santa Cruz, UT Dallas, and Amazon Research benchmarked VLLMs as ORMs and PRMs across multiple tasks, revealing that neither consistently outperforms the other.
VLLMs are increasingly effective across various tasks, particularly when evaluated for test-time scaling. ORMs generally outperform PRMs, while a hybrid approach between ORM and PRM is optimal.