Reward baseline is crucial for reducing variance in Reinforcement Learning algorithms, especially in language modeling.
Group Relative Policy Optimization (GRPO) is a method proposed for computing advantages in language modeling by subtracting the mean reward of all outputs in a group.
A new model, Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), incorporates lightweight Kalman filtering to dynamically estimate reward mean and variance, improving advantage estimation in noisy reward environments.
KRPO enhances the stability and performance of GRPO without the need for additional learned parameters, providing a simple yet effective way to optimize policies in dynamic reward signal settings.