menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Accelerati...
source image

Arxiv

2d

read

233

img
dot

Image Credit: Arxiv

Accelerating RL for LLM Reasoning with Optimal Advantage Regression

  • Reinforcement learning (RL) is being used to fine-tune large language models (LLMs) to enhance reasoning abilities.
  • A new two-stage policy optimization framework called $A$*-PO is introduced to efficiently train LLMs for reasoning tasks.
  • The $A$*-PO framework approximates the optimal advantage function and eliminates the need for costly online value estimation.
  • $A$*-PO achieves competitive performance on mathematical reasoning benchmarks, reduces training time by up to 2$ imes$, and decreases peak memory usage by over 30% compared to existing methods.

Read Full Article

like

14 Likes

For uninterrupted reading, download the app