menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Quadratic ...
source image

Arxiv

2d

read

258

img
dot

Image Credit: Arxiv

Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

  • A new research paper explores the connection between Mixture of Experts (MoE) models and the self-attention mechanism, revealing that each row of a self-attention matrix can be expressed as a quadratic gating mixture of linear experts.
  • The study conducts a thorough convergence analysis of MoE models using different quadratic gating functions, suggesting that the quadratic monomial gate enhances sample efficiency for parameter estimation compared to the quadratic polynomial gate.
  • The analysis shows that employing non-linear experts instead of linear ones leads to faster parameter and expert estimation rates. The research proposes an 'active-attention' mechanism by applying a non-linear activation function to the value matrix in the self-attention formula.
  • Through extensive experiments in tasks such as image classification, language modeling, and time series forecasting, the proposed active-attention mechanism is shown to outperform standard self-attention.

Read Full Article

like

15 Likes

For uninterrupted reading, download the app