Reinforcement Learning (RL) plays a crucial role in training Language Model Models (LLMs), as it aligns LLM-generated responses with human preferences through feedback.
RL involves trial-and-error learning with rewards that guide model behavior toward maximizing cumulative rewards over time.
RL is valuable when clear labels are unavailable, making it useful for tasks like training robots to walk.
Reinforcement Learning from Human Feedback (RLHF) involves learning a reward function from human feedback to guide model training.
RL algorithms are classified into three major categories: value-based, policy-based, and Actor-Critic RL.
Value-based RL updates value functions based on the Bellman Equation, policy-based RL optimizes policy networks, and Actor-Critic RL combines both approaches.
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prior algorithms in RL.
GRPO (Grouped Reward Policy Optimization) addresses challenges in Actor-Critic RL, eliminating the need for a separate value network.
GRPO focuses on optimizing policy networks using grouped structures and relative reward estimations within each group.
By estimating advantages within each group, GRPO simplifies training resources and enhances stability in RL training.
GRPO's approach of utilizing grouped structures and relative rewards sets it apart from traditional Actor-Critic methods, making it a purely policy-based RL strategy.