<ul><li>Recent studies have explored self-training methods for improving reasoning capabilities of Large Language Models (LLMs) using pseudo-labels generated by the LLMs themselves.</li><li>Confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, relying on majority voting to estimate confidence.</li><li>Proposed self-training method, CORE-PO, advocates using reasoning-level confidence to identify high-quality reasoning paths, leading to improved accuracy on various benchmarks.</li><li>CORE-PO fine-tunes LLMs through Policy Optimization to prefer high-confidence reasoning paths, demonstrating enhanced performance compared to existing self-training methods.</li></ul>

Self-Training Large Language Models with Confident Reasoning

Discover more