menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Learning E...
source image

Arxiv

2d

read

7

img
dot

Image Credit: Arxiv

Learning Explainable Dense Reward Shapes via Bayesian Optimization

  • Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence.
  • This work proposes a reward-shaping function that leverages explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.
  • The study utilizes a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from token reward estimates.
  • Experiments demonstrate that achieving a better balance of token-level reward attribution leads to improved performance and faster training of the optimal policy.

Read Full Article

like

Like

For uninterrupted reading, download the app