Reward learning in reinforcement learning aims to address the challenge of specifying reward functions accurately for a given task.
A learned reward model can have low error on data distribution but may result in a policy with significant regret, termed as an error-regret mismatch, mainly due to distributional shift during policy optimization.
The study mathematically demonstrates that while a low expected test error of the reward model ensures low worst-case regret, fixed expected test error can lead to an error-regret mismatch under realistic data distributions.
Even with policy regularization techniques like RLHF, similar issues persist, highlighting the need for improved methods in learning reward models and assessing their quality accurately.