ByteDance, Tsinghua University, and the University of Hong Kong have released DAPO, an open-source reinforcement learning system called Dynamic Sampling Policy Optimization for Large Language Models (LLMs).
DAPO aims to enhance the reasoning abilities of LLMs and promote reproducibility by openly sharing algorithmic details, training procedures, and datasets.
DAPO incorporates four core innovations: Clip-Higher, Dynamic Sampling, Token-level Policy Gradient Loss, and Overlong Reward Shaping.
Experimental results demonstrate significant improvements with DAPO, achieving higher scores on the American Invitational Mathematics Examination (AIME) 2024 benchmark with half the training steps compared to previous methods.