LASeR (Learning to Adaptively Select Rewards) addresses the challenge of utilizing multiple reward models efficiently when training large language models (LLMs).
It frames reward model selection as a multi-armed bandit problem to iteratively train LLMs using the most suitable reward models for each instance.
LASeR improved LLM training on commonsense, math reasoning, and open-ended instruction-following tasks, showing enhanced accuracy and speed compared to using an ensemble of reward models.
The study demonstrated that LASeR achieved significant performance gains in various tasks, such as boosting average accuracy and efficiency in LLM training as well as improving performance in long-context generation tasks.