Reinforcement learning has been successful in enhancing reasoning capabilities in large language models, with Group Relative Policy Optimization (GRPO) being a widely used method known for its memory efficiency and success in training DeepSeek-R1.
However, GRPO faces issues when all sampled responses in a group are incorrect, known as an 'all-negative-sample' group, which hinders learning progress by failing to update the policy.
This paper introduces a framework to bring response diversity within all-negative-sample groups in GRPO using AI feedback, supported by theoretical analysis showing how this diversification improves learning dynamics.
Empirical validation of this approach demonstrates improved performance across various model sizes in offline and online learning settings, highlighting the benefits of learning from all-negative-sample groups and advancing recent insights in this area.