menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

SRPO: A Cr...
source image

Arxiv

4d

read

226

img
dot

Image Credit: Arxiv

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

  • SRPO is presented as a cross-domain implementation of large-scale reinforcement learning on Large Language Models (LLMs).
  • Recent advances in reasoning models, such as OpenAI's o1 and DeepSeek's R1, demonstrate the potential of RL in enhancing the reasoning capabilities of LLMs.
  • SRPO surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks using the same base model (Qwen2.5-32B) without prior Supervised Fine-Tuning (SFT).
  • SRPO introduces a two-stage cross-domain training paradigm and History Resampling (HR) technique, which address the development of mathematical reasoning and coding proficiency, as well as ineffective samples.

Read Full Article

like

13 Likes

For uninterrupted reading, download the app