Reinforcement learning (RL) is being used to enhance the reasoning capabilities of large language models (LLMs) in scenarios where supervised fine-tuning falls short.
Group Relative Policy Optimization (GRPO) is a RL-based post-training method that stands out for its elimination of the dependency on the value model, simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO).
Existing group relative advantage estimation methods face training inefficiencies, especially when the estimated advantage approaches zero. To overcome this, a new RL algorithm called Advantage-Augmented Policy Optimization (AAPO) is proposed.
AAPO optimizes the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme, effectively addressing the inefficiencies associated with group relative advantage estimation. Experimental results on various mathematical reasoning benchmarks show the superior performance of AAPO.