Recurrent models like state space models and linear attention have gained popularity due to their linear complexity in sequence length.
These models, theoretically able to process long sequences, sometimes struggle to generalize beyond their training context lengths, a phenomenon known as failing to length generalize.
An analysis supports the unexplored states hypothesis, suggesting that models struggle with length generalization when exposed to a limited subset of attainable states during training.
Simple training interventions like initializing states with Gaussian noise or final states of different input sequences have shown to enable length generalization for significantly longer sequences, offering an efficient way to improve performance in long context tasks for recurrent models.