Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence.
This work proposes a reward-shaping function that leverages explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.
The study utilizes a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from token reward estimates.
Experiments demonstrate that achieving a better balance of token-level reward attribution leads to improved performance and faster training of the optimal policy.