<ul><li>Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs) to enhance reasoning capabilities.</li><li>This paper introduces two techniques, difficulty-targeted online data selection and rollout replay, to improve data efficiency in LLM RL fine-tuning.</li><li>The method proposes adaptive difficulty to prioritize questions of moderate difficulty for learning signals and uses an attention-based framework for estimating adaptive difficulty efficiently.</li><li>Experiments across 6 LLM-dataset combinations demonstrate that the proposed method reduces RL fine-tuning time by 25% to 65% while achieving the same performance level as the original GRPO algorithm.</li></ul>

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Discover more