<ul><li>The paper discusses the infinite-width limit of a single attention layer in neural networks by leveraging the Tensor Programs framework.</li><li>Current Gaussian-based theories fail to accurately model attention layers, but this study identifies the distribution of variables in an attention layer without relying on infinite-head approximations or tailored scalings.</li><li>The resulting limit law deviates from Gaussianity and showcases non-Gaussian behavior due to a hierarchical structure, being Gaussian conditional on random similarity scores.</li><li>Numerical experiments validate the theoretical predictions, highlighting the theory's effectiveness in describing finite-width and finite-head attentions, with implications for deep Transformer architectures.</li></ul>

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Discover more