Large Language Models (LLMs) face computational challenges due to quadratic complexity of self-attention during pre-filling phase.
Existing methods use dynamic pattern matching and block-sparse low-level implementations, but fail to capture global contexts.
AnchorAttention is a dynamic sparse attention mechanism that efficiently identifies critical attention regions at finer stripe granularity while adapting to global contextual information.
AnchorAttention achieves higher sparsity rates, significantly reducing computation time with a speedup of 1.44x compared to previous state-of-the-art methods at a text length of 128k.