The shift towards training large language models using reinforcement learning on verifiable rewards has shown advancements in code and mathematical reasoning.
The current methodology is limited to tasks with rule-based answer verification and does not easily extend to real-world domains like chemistry, healthcare, engineering, law, biology, business, and economics.
A verifier-free method named VeriFree is proposed to extend training to general reasoning domains, bypassing answer verification and maximizing the probability of generating the reference answer directly.
Comparison with verifier-based methods shows that VeriFree offers practical benefits, reduced compute requirements, and performs well on evaluations across various benchmarks.