<ul data-eligibleForWebStory="true"><li>RLPR (Reinforcement Learning with Reference Probability Reward) is proposed to address the limitation of RLVR in general domains without verifiers.</li><li>RLVR is often restricted to math and code tasks due to the reliance on domain-specific verifiers for rewards.</li><li>RLPR uses the model's own probability of generating a reference answer as a reward signal, eliminating the need for external verifiers.</li><li>This self-rewarding setup alleviates manual reward engineering and enhances scalability across different domains.</li><li>The authors present two main technical contributions in the RLPR framework.</li><li>Experiments conducted on 7 benchmarks and 3 model families (Qwen, Llama, Gemma) demonstrate that RLPR improves general reasoning and outperforms verifier-free baselines and even specialized verifier methods like LLMs.</li><li>The code for RLPR has been made publicly available.</li></ul>

No Verifier? No Problem: Reinforcement Learning with Reference Probabilities

Discover more