menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Harnessing...
source image

Arxiv

2d

read

283

img
dot

Image Credit: Arxiv

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

  • Recent advances in model distillation show that data from advanced reasoning models can transfer reasoning abilities to smaller student models efficiently.
  • A new framework called Reinforcement Distillation (REDI) is proposed to leverage both positive and negative reasoning traces for maximizing LLM reasoning performance.
  • REDI consists of a two-stage process where Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT) and Stage 2 refines the model using both positive and negative traces through the REDI objective.
  • Empirical evaluations demonstrate that REDI outperforms established methods like DPO and SimPO in the distillation context, achieving a new state-of-the-art for post-training 1.5B models offline with openly available data.

Read Full Article

like

17 Likes

For uninterrupted reading, download the app