Recent studies have explored self-training methods for improving reasoning capabilities of Large Language Models (LLMs) using pseudo-labels generated by the LLMs themselves.
Confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, relying on majority voting to estimate confidence.
Proposed self-training method, CORE-PO, advocates using reasoning-level confidence to identify high-quality reasoning paths, leading to improved accuracy on various benchmarks.
CORE-PO fine-tunes LLMs through Policy Optimization to prefer high-confidence reasoning paths, demonstrating enhanced performance compared to existing self-training methods.