<ul data-eligibleForWebStory="false"><li>Process Reinforcement Learning (PRL) has shown potential in enhancing the reasoning abilities of Large Language Models (LLMs).</li><li>A novel framework called Self-Guided Process Reward Optimization (SPRO) is proposed for process-aware RL with two key innovations.</li><li>SPRO outperforms vanilla GRPO with higher training efficiency and test accuracy improvement, without incurring additional computational overhead.</li><li>Experimental results show SPRO maintains stable policy entropy, reduces response length, and prevents reward hacking, making it suitable for industrial implementation.</li></ul>

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Discover more