<ul><li>ByteDance, Tsinghua University, and the University of Hong Kong have released DAPO, an open-source reinforcement learning system called Dynamic Sampling Policy Optimization for Large Language Models (LLMs).</li><li>DAPO aims to enhance the reasoning abilities of LLMs and promote reproducibility by openly sharing algorithmic details, training procedures, and datasets.</li><li>DAPO incorporates four core innovations: Clip-Higher, Dynamic Sampling, Token-level Policy Gradient Loss, and Overlong Reward Shaping.</li><li>Experimental results demonstrate significant improvements with DAPO, achieving higher scores on the American Invitational Mathematics Examination (AIME) 2024 benchmark with half the training steps compared to previous methods.</li></ul>

ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale

Discover more