How to Train LLMs to “Think” (o1 & DeepSeek-R1)

A naukri.com initiative

New

How to Tra...

Towards Data Science

313

Image Credit: Towards Data Science

The article discusses insights from o1 and DeepSeek-R1 models, focusing on the impact of increased test-time compute on model performance.
o1 demonstrated that generating more tokens leads to better responses, showing a new scaling law in LLMs.
o1 introduced 'thinking' tokens to aid in post-training reasoning, allowing a human-interpretable insight into the model's thinking process.
DeepSeek-R1, unveiled in January 2025, explored reasoning in LLMs through reinforcement learning.
DeepSeek-R1 includes models such as DeepSeek-R1-Zero and DeepSeek-R1, focusing on reasoning capabilities through RL and supervised fine-tuning (SFT).
R1-Zero demonstrated emergent reasoning abilities through RL alone, discovering CoT and test-time compute scaling.
Reinforcement learning in R1-Zero involves a prompt template, dual-component rewards, and GRPO for stable model training.
DeepSeek-R1 was developed leveraging training strategies involving SFT, RL, and a multi-step process to enhance reasoning abilities.
The article highlights the interpretability issues faced by R1-Zero and the steps taken to improve interpretability through training strategies.
DeepSeek-R1 excels in reasoning tasks post several training steps involving SFT, RL, and human feedback.
The release of o1 and DeepSeek-R1 showcases advancements in LLMs using reinforcement learning, offering promising research directions for independent learning models.

Read Full Article

18 Likes

For uninterrupted reading, download the app