Researchers have introduced Segment Policy Optimization (SPO) to enhance the reasoning capabilities of large language models effectively using reinforcement learning.
SPO offers more precise credit assignment than trajectory-level methods and requires fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo without a critic model.
SPO features three components with novel strategies: flexible segment partition, accurate segment advantage estimation, and policy optimization using segment advantages, including a probability-mask strategy.
SPO has been instantiated for short chain-of-thought (CoT) and long CoT scenarios, achieving significant improvements in accuracy over existing methods on datasets like GSM8K and MATH500.