<ul><li>Large Language Models (LLMs) face computational challenges due to quadratic complexity of self-attention during pre-filling phase.</li><li>Existing methods use dynamic pattern matching and block-sparse low-level implementations, but fail to capture global contexts.</li><li>AnchorAttention is a dynamic sparse attention mechanism that efficiently identifies critical attention regions at finer stripe granularity while adapting to global contextual information.</li><li>AnchorAttention achieves higher sparsity rates, significantly reducing computation time with a speedup of 1.44x compared to previous state-of-the-art methods at a text length of 128k.</li></ul>

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

Discover more