xVerify introduces an efficient answer verifier tailored for evaluating reasoning model responses on objective questions, overcoming challenges in extracting final answers and ensuring answer equivalence across formats.
The evaluation task is formalized as a 4-tuple (Q,R,Aref,E), emphasizing the extraction of candidate answers and equivalence comparison to reference answers.
The researchers created the VAR dataset, comprising diverse LLM responses from 19 models across 24 datasets, including multiple question types, prompting strategies, and high-quality annotations.
Training 14 xVerify models on the VAR dataset demonstrated superior performance across multiple question types, showcasing generalization ability and efficiency compared to existing methods.
xVerify outperformed rule-based frameworks and judge models in accuracy and cost-effectiveness, with even the smallest model (0.5B parameters) achieving high accuracy and computational efficiency.
Strong generalization was observed in cases of unseen datasets and models, reinforcing the effectiveness of targeted training and the quality of the VAR dataset.
The study highlights the importance of specialized evaluation tools like xVerify for assessing reasoning model outputs accurately amidst increasing complexity, setting a precedent for tailored verifiers in complex LLM evaluation tasks.
By combining innovative data collection, annotation methods, and targeted training, xVerify has emerged as a robust verifier surpassing rule-based frameworks and general-purpose judge models.
The findings suggest that even smaller parameter models excel at specialized tasks when trained on high-quality datasets, offering computational efficiency and cost-effectiveness in large-scale evaluations.
xVerify's contributions lie in the creation of the VAR dataset, the development of the xVerify model family, and the demonstration of its superiority in accuracy, generalization ability, computational efficiency, and cost-effectiveness.