Developing recurrent sub-quadratic models to enhance long-context processing efficiency is a recent trend in large language models.
Experiments show that current models, despite being trained for extended contexts, underutilize long contexts due to fixed-size recurrent memory constraints.
A chunk-based inference method that focuses on relevant parts of input improves performance significantly for various long-context tasks.
This approach not only mitigates recurrent memory failures but also achieves state-of-the-art results in challenging benchmarks, questioning the true utilization of long-range dependencies in recurrent models.