<ul><li>Reinforcement learning (RL) is being used to enhance the reasoning capabilities of large language models (LLMs) in scenarios where supervised fine-tuning falls short.</li><li>Group Relative Policy Optimization (GRPO) is a RL-based post-training method that stands out for its elimination of the dependency on the value model, simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO).</li><li>Existing group relative advantage estimation methods face training inefficiencies, especially when the estimated advantage approaches zero. To overcome this, a new RL algorithm called Advantage-Augmented Policy Optimization (AAPO) is proposed.</li><li>AAPO optimizes the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme, effectively addressing the inefficiencies associated with group relative advantage estimation. Experimental results on various mathematical reasoning benchmarks show the superior performance of AAPO.</li></ul>

AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

Discover more