SRPO is presented as a cross-domain implementation of large-scale reinforcement learning on Large Language Models (LLMs).
Recent advances in reasoning models, such as OpenAI's o1 and DeepSeek's R1, demonstrate the potential of RL in enhancing the reasoning capabilities of LLMs.
SRPO surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks using the same base model (Qwen2.5-32B) without prior Supervised Fine-Tuning (SFT).
SRPO introduces a two-stage cross-domain training paradigm and History Resampling (HR) technique, which address the development of mathematical reasoning and coding proficiency, as well as ineffective samples.