Training reasoning language models with reinforcement learning for one-hot correctness relies on LM's ability to explore and solve tasks.
A new framework introduces Reinforcement-Learned Teachers (RLTs) to avoid RL's exploration challenge by focusing on yielding effective downstream distillation.
RLTs are prompted with both questions and solutions to problems to provide detailed explanations tailored for students.
In practice, 7B RLTs show higher performance on tasks compared to existing distillation pipelines and can be effectively used for out-of-distribution tasks, enhancing efficiency in the RL reasoning framework.