<ul data-eligibleForWebStory="false"><li>Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training to improve reasoning by optimizing model outputs based on reward signals.</li><li>A new Self-Explanation Policy Optimization (ExPO) framework has been introduced to address limitations in refining model knowledge and enabling exploration beyond current output distributions.</li><li>ExPO generates positive samples through conditioning on the ground-truth answer, facilitating efficient exploration and guiding the model to produce better reasoning trajectories.</li><li>Experiments demonstrate that ExPO outperforms expert-demonstration-based methods in challenging settings, enhancing learning efficiency and final performance on reasoning benchmarks.</li></ul>

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Discover more