<ul data-eligibleForWebStory="false"><li>Reward learning in reinforcement learning aims to address the challenge of specifying reward functions accurately for a given task.</li><li>A learned reward model can have low error on data distribution but may result in a policy with significant regret, termed as an error-regret mismatch, mainly due to distributional shift during policy optimization.</li><li>The study mathematically demonstrates that while a low expected test error of the reward model ensures low worst-case regret, fixed expected test error can lead to an error-regret mismatch under realistic data distributions.</li><li>Even with policy regularization techniques like RLHF, similar issues persist, highlighting the need for improved methods in learning reward models and assessing their quality accurately.</li></ul>

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Discover more