<ul><li>Sequence modeling in neural networks is dominated by transformers with softmax self-attention, but they require scaling memory and compute during inference.</li><li>Recent work has introduced models like DeltaNet, Mamba, and xLSTM, which have constant memory and compute costs due to linearized softmax operation.</li><li>A new Mesa layer has been introduced for language modeling at a billion-parameter scale, using an online learning rule for recurrent layer dynamics derivation.</li><li>Optimal test-time training with the Mesa layer achieves lower language modeling perplexity and higher benchmark performance, although it increases flops spent during inference.</li></ul>

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Discover more