Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training to improve reasoning by optimizing model outputs based on reward signals.
A new Self-Explanation Policy Optimization (ExPO) framework has been introduced to address limitations in refining model knowledge and enabling exploration beyond current output distributions.
ExPO generates positive samples through conditioning on the ground-truth answer, facilitating efficient exploration and guiding the model to produce better reasoning trajectories.
Experiments demonstrate that ExPO outperforms expert-demonstration-based methods in challenging settings, enhancing learning efficiency and final performance on reasoning benchmarks.