menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Taming Tra...
source image

Arxiv

3d

read

31

img
dot

Image Credit: Arxiv

Taming Transformer Without Using Learning Rate Warmup

  • Training Transformers at a large scale without using technical tricks like learning rate warmup and with a lower learning rate is challenging and gaining attention.
  • The paper provides a theoretical analysis of training Transformer models and discovers a phenomenon termed 'spectral energy concentration', leading to model crash.
  • To address the issue, a novel optimization strategy inspired by 'Weyl's Inequality' is introduced to make weight updating smoother, preventing entropy collapse and model crash.
  • Experiments conducted with ViT, Swin-Transformer, and GPT demonstrate the effectiveness of the optimization strategy in training Transformers without using learning rate warmup.

Read Full Article

like

1 Like

For uninterrupted reading, download the app