DeepSeek Math uses a rule-based reward system to solve IMO questions.The reward system includes accuracy rewards and format rewards.DeepSeek Math adopted Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).The model prevents catastrophic forgetting of the learned LLMs by bounding policy updates.