<ul><li>Training reasoning language models with reinforcement learning for one-hot correctness relies on LM's ability to explore and solve tasks.</li><li>A new framework introduces Reinforcement-Learned Teachers (RLTs) to avoid RL's exploration challenge by focusing on yielding effective downstream distillation.</li><li>RLTs are prompted with both questions and solutions to problems to provide detailed explanations tailored for students.</li><li>In practice, 7B RLTs show higher performance on tasks compared to existing distillation pipelines and can be effectively used for out-of-distribution tasks, enhancing efficiency in the RL reasoning framework.</li></ul>

Reinforcement Learning Teachers of Test Time Scaling

Discover more