<ul><li>Recent advances in model distillation show that data from advanced reasoning models can transfer reasoning abilities to smaller student models efficiently.</li><li>A new framework called Reinforcement Distillation (REDI) is proposed to leverage both positive and negative reasoning traces for maximizing LLM reasoning performance.</li><li>REDI consists of a two-stage process where Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT) and Stage 2 refines the model using both positive and negative traces through the REDI objective.</li><li>Empirical evaluations demonstrate that REDI outperforms established methods like DPO and SimPO in the distillation context, achieving a new state-of-the-art for post-training 1.5B models offline with openly available data.</li></ul>

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Discover more