<ul><li>Training Transformers at a large scale without using technical tricks like learning rate warmup and with a lower learning rate is challenging and gaining attention.</li><li>The paper provides a theoretical analysis of training Transformer models and discovers a phenomenon termed 'spectral energy concentration', leading to model crash.</li><li>To address the issue, a novel optimization strategy inspired by 'Weyl's Inequality' is introduced to make weight updating smoother, preventing entropy collapse and model crash.</li><li>Experiments conducted with ViT, Swin-Transformer, and GPT demonstrate the effectiveness of the optimization strategy in training Transformers without using learning rate warmup.</li></ul>

Taming Transformer Without Using Learning Rate Warmup

Discover more