How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

A naukri.com initiative

New

Home

ML News

How LLMs W...

Towards Data Science

431

Image Credit: Towards Data Science

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Reinforcement Learning (RL) is a critical part of the training pipeline for Large Language Models (LLMs) as it allows the model to learn from its own experience.
RL enables the model to explore different token sequences and receive feedback on which outputs are most useful, leading to better alignment with human intent over time.
LLMs are stochastic, meaning their responses vary even with the same prompt due to sampling from a probability distribution, allowing for exploration of different paths.
By training LLMs using reinforcement learning, they can discover and refine strategies beyond human knowledge, as seen in DeepMind's AlphaGo surpassing human-level play through self-play.
RL involves the agent taking actions in an environment, receiving rewards as feedback, and gradually learning the optimal strategy to maximize total rewards over time.
A key RL setup involves the policy determining the agent's strategy and the value function estimating the long-term expected reward for a given state.
Deepseek-R1-Zero and Deepseek-R1 are open-source reasoning models that showcase the power of RL algorithms like Group Relative Policy Optimization (GRPO) over Proximal Policy Optimization (PPO).
GRPO addresses challenges faced by PPO in reasoning tasks by using relative evaluation within a group to converge towards higher quality performance over time.
DeepSeek-R1-Zero skipped supervised fine-tuning, allowing direct exploration of CoT reasoning, leading to improved complex reasoning capabilities.
RL training can lead to emergent properties like chain-of-thought reasoning and unexpected outcomes, as seen in DeepSeek-R1-Zero refining its reasoning autonomously.
Human feedback plays a crucial role in evaluating AI responses, especially in areas like summarization and creative writing, where there is no single 'correct' answer.

Read Full Article

25 Likes

Discover more

For uninterrupted reading, download the app