This work explores the impact of scaling on language models and training dynamics.
Language models experience loss deceleration early in training, resulting in a piecewise linear behavior of the loss curve in log-log space.
Scaling up the model helps mitigate this transition by improving the rate of loss improvement after deceleration and lowering the loss at which deceleration occurs.
Loss deceleration is attributed to a training dynamic known as zero-sum learning (ZSL), where per-example gradients oppose each other, hindering overall progress.