menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

AAPO: Enha...
source image

Arxiv

2d

read

93

img
dot

Image Credit: Arxiv

AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

  • Reinforcement learning (RL) is being used to enhance the reasoning capabilities of large language models (LLMs) in scenarios where supervised fine-tuning falls short.
  • Group Relative Policy Optimization (GRPO) is a RL-based post-training method that stands out for its elimination of the dependency on the value model, simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO).
  • Existing group relative advantage estimation methods face training inefficiencies, especially when the estimated advantage approaches zero. To overcome this, a new RL algorithm called Advantage-Augmented Policy Optimization (AAPO) is proposed.
  • AAPO optimizes the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme, effectively addressing the inefficiencies associated with group relative advantage estimation. Experimental results on various mathematical reasoning benchmarks show the superior performance of AAPO.

Read Full Article

like

5 Likes

For uninterrupted reading, download the app