menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Deep Learning News

>

DeepSeek E...
source image

Medium

2w

read

86

img
dot

Image Credit: Medium

DeepSeek Explained 6: All you need to know about Reinforcement Learning in LLM training

  • Reinforcement Learning (RL) plays a crucial role in training Language Model Models (LLMs), as it aligns LLM-generated responses with human preferences through feedback.
  • RL involves trial-and-error learning with rewards that guide model behavior toward maximizing cumulative rewards over time.
  • RL is valuable when clear labels are unavailable, making it useful for tasks like training robots to walk.
  • Reinforcement Learning from Human Feedback (RLHF) involves learning a reward function from human feedback to guide model training.
  • RL algorithms are classified into three major categories: value-based, policy-based, and Actor-Critic RL.
  • Value-based RL updates value functions based on the Bellman Equation, policy-based RL optimizes policy networks, and Actor-Critic RL combines both approaches.
  • Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prior algorithms in RL.
  • GRPO (Grouped Reward Policy Optimization) addresses challenges in Actor-Critic RL, eliminating the need for a separate value network.
  • GRPO focuses on optimizing policy networks using grouped structures and relative reward estimations within each group.
  • By estimating advantages within each group, GRPO simplifies training resources and enhances stability in RL training.
  • GRPO's approach of utilizing grouped structures and relative rewards sets it apart from traditional Actor-Critic methods, making it a purely policy-based RL strategy.

Read Full Article

like

5 Likes

For uninterrupted reading, download the app