<ul><li>This work explores the impact of scaling on language models and training dynamics.</li><li>Language models experience loss deceleration early in training, resulting in a piecewise linear behavior of the loss curve in log-log space.</li><li>Scaling up the model helps mitigate this transition by improving the rate of loss improvement after deceleration and lowering the loss at which deceleration occurs.</li><li>Loss deceleration is attributed to a training dynamic known as zero-sum learning (ZSL), where per-example gradients oppose each other, hindering overall progress.</li></ul>

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Discover more