Recent advances in reasoning language models have shifted towards long CoT patterns, leading to a focus on optimizing fixed training datasets efficiently.
Existing methods overlook the value in negative responses, which contain elements like self-reflection and error correction. The proposed BCPG-NSA framework aims to leverage these learning signals for enhanced policy optimization.
BCPG-NSA involves sample segmentation, step correctness assessment through LLM and PRM judgers, and policy optimization with negative sample augmentation. It outperforms baselines on math/coding reasoning benchmarks, showcasing improved sample efficiency and scalability.