Researchers introduce LPPO framework to enhance Large Language Models' reasoning capabilities through progressive optimization.
LPPO framework leverages a small set of high-quality demonstrations using prefix-guided sampling and learning-progress weighting.
Prefix-guided sampling augments data with partial solution prefixes from expert demonstrations to improve policy guidance.
Learning-progress weighting adjusts sample influence based on model progression, leading to faster convergence and improved performance on mathematical-reasoning benchmarks.