The paper discusses the infinite-width limit of a single attention layer in neural networks by leveraging the Tensor Programs framework.
Current Gaussian-based theories fail to accurately model attention layers, but this study identifies the distribution of variables in an attention layer without relying on infinite-head approximations or tailored scalings.
The resulting limit law deviates from Gaussianity and showcases non-Gaussian behavior due to a hierarchical structure, being Gaussian conditional on random similarity scores.
Numerical experiments validate the theoretical predictions, highlighting the theory's effectiveness in describing finite-width and finite-head attentions, with implications for deep Transformer architectures.