Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training.
The existing practice of determining competitive data mixtures in small-scale experiments and directly applying them at larger scales may not retain their advantage.
AutoScale is a two-stage, scale-aware data composition framework that fits a parametric model to predict the loss under different data compositions, finding an approximate best allocation at smaller budgets and extrapolating that composition to larger budgets.
Empirical results show that AutoScale accelerates convergence, improves downstream performance, and achieves a 28% faster perplexity reduction than baselines when pre-training GPT-2 Large.