Training Transformers at a large scale without using technical tricks like learning rate warmup and with a lower learning rate is challenging and gaining attention.
The paper provides a theoretical analysis of training Transformer models and discovers a phenomenon termed 'spectral energy concentration', leading to model crash.
To address the issue, a novel optimization strategy inspired by 'Weyl's Inequality' is introduced to make weight updating smoother, preventing entropy collapse and model crash.
Experiments conducted with ViT, Swin-Transformer, and GPT demonstrate the effectiveness of the optimization strategy in training Transformers without using learning rate warmup.