What makes DeepSeek Math solve IMO questions

A naukri.com initiative

New

What makes...

Medium

Image Credit: Medium

DeepSeek Math uses a rule-based reward system to solve IMO questions.
The reward system includes accuracy rewards and format rewards.
DeepSeek Math adopted Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).
The model prevents catastrophic forgetting of the learned LLMs by bounding policy updates.

Read Full Article

2 Likes

For uninterrupted reading, download the app