menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Spectral P...
source image

Arxiv

2d

read

77

img
dot

Image Credit: Arxiv

Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

  • Reinforcement learning has been successful in enhancing reasoning capabilities in large language models, with Group Relative Policy Optimization (GRPO) being a widely used method known for its memory efficiency and success in training DeepSeek-R1.
  • However, GRPO faces issues when all sampled responses in a group are incorrect, known as an 'all-negative-sample' group, which hinders learning progress by failing to update the policy.
  • This paper introduces a framework to bring response diversity within all-negative-sample groups in GRPO using AI feedback, supported by theoretical analysis showing how this diversification improves learning dynamics.
  • Empirical validation of this approach demonstrates improved performance across various model sizes in offline and online learning settings, highlighting the benefits of learning from all-negative-sample groups and advancing recent insights in this area.

Read Full Article

like

4 Likes

For uninterrupted reading, download the app