Sequence modeling in neural networks is dominated by transformers with softmax self-attention, but they require scaling memory and compute during inference.
Recent work has introduced models like DeltaNet, Mamba, and xLSTM, which have constant memory and compute costs due to linearized softmax operation.
A new Mesa layer has been introduced for language modeling at a billion-parameter scale, using an online learning rule for recurrent layer dynamics derivation.
Optimal test-time training with the Mesa layer achieves lower language modeling perplexity and higher benchmark performance, although it increases flops spent during inference.