Linear Attention (LA) is an important framework that popularized kernel attention and its relation to recurrent autoregressive models.
LA has various variants such as Random Feature Attention (RFA), Performer, TransNormer, cosFormer, and Linear Randomized Attention.
Efficient attention models beyond kernel attention also exist.
Long context models have become popular, but this work presents one of the first approaches that demonstrate increasing performance with longer context.