A new research paper explores the connection between Mixture of Experts (MoE) models and the self-attention mechanism, revealing that each row of a self-attention matrix can be expressed as a quadratic gating mixture of linear experts.
The study conducts a thorough convergence analysis of MoE models using different quadratic gating functions, suggesting that the quadratic monomial gate enhances sample efficiency for parameter estimation compared to the quadratic polynomial gate.
The analysis shows that employing non-linear experts instead of linear ones leads to faster parameter and expert estimation rates. The research proposes an 'active-attention' mechanism by applying a non-linear activation function to the value matrix in the self-attention formula.
Through extensive experiments in tasks such as image classification, language modeling, and time series forecasting, the proposed active-attention mechanism is shown to outperform standard self-attention.