Reinforcement learning (RL) is being used to fine-tune large language models (LLMs) to enhance reasoning abilities.
A new two-stage policy optimization framework called $A$*-PO is introduced to efficiently train LLMs for reasoning tasks.
The $A$*-PO framework approximates the optimal advantage function and eliminates the need for costly online value estimation.
$A$*-PO achieves competitive performance on mathematical reasoning benchmarks, reduces training time by up to 2$ imes$, and decreases peak memory usage by over 30% compared to existing methods.