menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

The Perils...
source image

Arxiv

7h

read

295

img
dot

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

  • Reward learning in reinforcement learning aims to address the challenge of specifying reward functions accurately for a given task.
  • A learned reward model can have low error on data distribution but may result in a policy with significant regret, termed as an error-regret mismatch, mainly due to distributional shift during policy optimization.
  • The study mathematically demonstrates that while a low expected test error of the reward model ensures low worst-case regret, fixed expected test error can lead to an error-regret mismatch under realistic data distributions.
  • Even with policy regularization techniques like RLHF, similar issues persist, highlighting the need for improved methods in learning reward models and assessing their quality accurately.

Read Full Article

like

17 Likes

For uninterrupted reading, download the app