<ul data-eligibleForWebStory="false"><li>LASeR (Learning to Adaptively Select Rewards) addresses the challenge of utilizing multiple reward models efficiently when training large language models (LLMs).</li><li>It frames reward model selection as a multi-armed bandit problem to iteratively train LLMs using the most suitable reward models for each instance.</li><li>LASeR improved LLM training on commonsense, math reasoning, and open-ended instruction-following tasks, showing enhanced accuracy and speed compared to using an ensemble of reward models.</li><li>The study demonstrated that LASeR achieved significant performance gains in various tasks, such as boosting average accuracy and efficiency in LLM training as well as improving performance in long-context generation tasks.</li></ul>

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

Discover more