Process Reinforcement Learning (PRL) has shown potential in enhancing the reasoning abilities of Large Language Models (LLMs).
A novel framework called Self-Guided Process Reward Optimization (SPRO) is proposed for process-aware RL with two key innovations.
SPRO outperforms vanilla GRPO with higher training efficiency and test accuracy improvement, without incurring additional computational overhead.
Experimental results show SPRO maintains stable policy entropy, reduces response length, and prevents reward hacking, making it suitable for industrial implementation.