<ul data-eligibleForWebStory="false"><li>A new research paper explores the connection between Mixture of Experts (MoE) models and the self-attention mechanism, revealing that each row of a self-attention matrix can be expressed as a quadratic gating mixture of linear experts.</li><li>The study conducts a thorough convergence analysis of MoE models using different quadratic gating functions, suggesting that the quadratic monomial gate enhances sample efficiency for parameter estimation compared to the quadratic polynomial gate.</li><li>The analysis shows that employing non-linear experts instead of linear ones leads to faster parameter and expert estimation rates. The research proposes an 'active-attention' mechanism by applying a non-linear activation function to the value matrix in the self-attention formula.</li><li>Through extensive experiments in tasks such as image classification, language modeling, and time series forecasting, the proposed active-attention mechanism is shown to outperform standard self-attention.</li></ul>

Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

Discover more