The authors present a new technique called Selective State Space Models which aims to speed up AI without extra costs.
The authors provide an overview of a model (SSMs) with larger hidden state dimension being more effective but slower, and identify that the recurrent mode is more flexible than the convolution mode, but the latter is more efficient.
The authors propose to leverage properties of modern accelerators (GPUs) to materialize the state h only in more efficient levels of the memory hierarchy. In particular, they attempt to not actually materialize the full state h.
Selective scan layer is illustrated in Figure 1: it is a memory-efficient layer which uses an efficient parallel scan algorithm to avoid sequential recurrence.
The intermediate states, which are necessary for backpropagation, are not stored but recomputed in the backward pass when the inputs are loaded from HBM to SRAM.
The authors state that fused selective scan layer has the same memory requirements as an optimized transformer implementation with FlashAttention.
The authors describe synthetic tasks and state space models, including language, DNA, and audio modeling and generation as examples of their implementation.
To make selective SSMs efficient on modern hardware (GPU) as well, the selection mechanism is designed to overcome the limitations of LTI models.
The authors recognize that one of the core limitations of the usage of SSMs is their computational efficiency, which was why all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions such as S4.
The authors rely on three classical techniques: kernel fusion, parallel scan, and recomputation to address the sequential nature of recurrence, and the large memory usage in SSMs.