Large Language Models (LLMs) require enhanced mathematical reasoning for improving AI capabilities.
A new paper introduces a practical training approach that combines Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL) for maximizing accuracy and efficiency.
The methodology involves extending SFT for up to 10 epochs to enhance accuracy and then using RL from online inference (GRPO) to improve token efficiency without compromising performance.
Experiments demonstrate the effectiveness of this approach, resulting in top-tier performance on benchmarks like the AI Mathematical Olympiad and providing a blueprint for developing advanced mathematical reasoning models.