<ul><li>DeepSeek R1 uses the Group Relative Policy Optimization (GRPO) approach to continually evaluate and select the best output.</li><li>Unlike traditional methods with fixed reward functions, DeepSeek R1 dynamically selects reward functions based on the task and goals of the prompt.</li><li>Multiple reward functions can be used to balance accuracy, speed, compliance, and safety.</li><li>This approach increases transparency, understanding, and trust in the AI's reasoning decisions.</li></ul>

DeepSeek’s AI improves when rewarded

Discover more