Transformer models have become essential in various scientific and engineering fields due to their exceptional performance.
Research has focused on comprehending the convergence dynamics of Transformers, especially concerning self-attention, feedforward networks, and residual connections.
The study reveals that proper initialization enables gradient descent to achieve a linear convergence rate influenced by the singular values of the attention layer output matrix.
Residual connections aid in improving optimization stability by mitigating the challenges arising from the low-rank structure imposed by softmax operation.