Recent advances in model distillation show that data from advanced reasoning models can transfer reasoning abilities to smaller student models efficiently.
A new framework called Reinforcement Distillation (REDI) is proposed to leverage both positive and negative reasoning traces for maximizing LLM reasoning performance.
REDI consists of a two-stage process where Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT) and Stage 2 refines the model using both positive and negative traces through the REDI objective.
Empirical evaluations demonstrate that REDI outperforms established methods like DPO and SimPO in the distillation context, achieving a new state-of-the-art for post-training 1.5B models offline with openly available data.