Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs) to enhance reasoning capabilities.
This paper introduces two techniques, difficulty-targeted online data selection and rollout replay, to improve data efficiency in LLM RL fine-tuning.
The method proposes adaptive difficulty to prioritize questions of moderate difficulty for learning signals and uses an attention-based framework for estimating adaptive difficulty efficiently.
Experiments across 6 LLM-dataset combinations demonstrate that the proposed method reduces RL fine-tuning time by 25% to 65% while achieving the same performance level as the original GRPO algorithm.