menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Kalman Fil...
source image

Arxiv

2d

read

38

img
dot

Image Credit: Arxiv

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

  • Reward baseline is crucial for reducing variance in Reinforcement Learning algorithms, especially in language modeling.
  • Group Relative Policy Optimization (GRPO) is a method proposed for computing advantages in language modeling by subtracting the mean reward of all outputs in a group.
  • A new model, Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), incorporates lightweight Kalman filtering to dynamically estimate reward mean and variance, improving advantage estimation in noisy reward environments.
  • KRPO enhances the stability and performance of GRPO without the need for additional learned parameters, providing a simple yet effective way to optimize policies in dynamic reward signal settings.

Read Full Article

like

2 Likes

For uninterrupted reading, download the app