<ul><li>SRPO is presented as a cross-domain implementation of large-scale reinforcement learning on Large Language Models (LLMs).</li><li>Recent advances in reasoning models, such as OpenAI's o1 and DeepSeek's R1, demonstrate the potential of RL in enhancing the reasoning capabilities of LLMs.</li><li>SRPO surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks using the same base model (Qwen2.5-32B) without prior Supervised Fine-Tuning (SFT).</li><li>SRPO introduces a two-stage cross-domain training paradigm and History Resampling (HR) technique, which address the development of mathematical reasoning and coding proficiency, as well as ineffective samples.</li></ul>

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Discover more