Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research.
A new approach called CacheFormer is proposed to tackle this problem by dividing long contexts into small segments.
The design of CacheFormer includes retrieving nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level.
CacheFormer outperforms existing state-of-the-art architectures with an average perplexity improvement of 8.5% over similar model sizes.