Farseer introduces a refined scaling law for Large Language Models (LLMs) to improve predictive accuracy across scales.
The traditional scaling gap between small-scale experiments and resource-intensive production systems is addressed by Farseer.
Farseer's model loss surface, $L(N,D)$, offers better empirical data fit compared to prior laws like Chinchilla's law.
The new scaling law reduces extrapolation error by 433% compared to existing models.
Farseer allows reliable evaluation of different training strategies across various scales, facilitating predictions of large-scale performance.
The methodology of Farseer enables confident extrapolation from small-scale ablation studies to predict performance at larger scales.
Insights into optimal compute allocation for LLM training are provided by Farseer, reflecting modern training demands.
Around 1,000 LLMs were trained across different scales and configurations to validate Farseer, utilizing approximately 3 million NVIDIA H100 GPU hours.
All models, data, results, and logs are open-sourced on GitHub to encourage further research and collaboration.