menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Unearthing...
source image

Arxiv

1d

read

345

img
dot

Image Credit: Arxiv

Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

  • Recent advances in reasoning language models have shifted towards long CoT patterns, leading to a focus on optimizing fixed training datasets efficiently.
  • Existing methods overlook the value in negative responses, which contain elements like self-reflection and error correction. The proposed BCPG-NSA framework aims to leverage these learning signals for enhanced policy optimization.
  • BCPG-NSA involves sample segmentation, step correctness assessment through LLM and PRM judgers, and policy optimization with negative sample augmentation. It outperforms baselines on math/coding reasoning benchmarks, showcasing improved sample efficiency and scalability.

Read Full Article

like

20 Likes

For uninterrupted reading, download the app