State-space models (SSMs) and transformers are widely used in language modeling but have lower computational complexity than recurrent neural networks (RNNs), limiting their expressivity.
RNNs lack parallelization during training, leading to a trade-off between parallelization and expressivity.
A new approach proposes implicit SSMs that iterate a transformation until convergence to a fixed point, implementing non-linear state-transitions of RNNs.
Approximate fixed-point convergence is found to be sufficient, allowing a scalable training curriculum with partial parallelization.
The implicit SSMs exhibit superior state-tracking capabilities on regular languages compared to transformers and SSMs.
Implicit SSMs are scaled to natural language reasoning tasks and pretraining large-scale language models with up to 1.3B parameters on 207B tokens, the largest implicit model trained to date.
The implicit models outperform explicit counterparts on standard benchmarks.
Code for the implicit language models is available on GitHub.