Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length.
Mamba, an alternative to Transformers, demonstrates high performance and achieves Transformer-level capabilities with fewer computational resources.
The length-generalization capabilities of Mamba are found to be relatively limited.
DeciMamba, a context-extension method designed for Mamba, enables the trained model to extrapolate well to longer context lengths without additional training.