<ul><li>LLM decoding faces challenges with large batches and long contexts due to loading the key-value cache from high-bandwidth memory, increasing per-token latency.</li><li>Redesigning attention to maximize hardware efficiency without sacrificing parallel scalability, Grouped-Tied Attention (GTA) combines and reuses key and value states, reducing memory transfers.</li><li>Grouped Latent Attention (GLA), paired with low-level optimizations, enhances parallel-friendly latent attention for fast decoding while maintaining high model quality.</li><li>Experiments show that GTA matches Grouped-Query Attention (GQA) quality using half the KV cache, and GLA matches Multi-head Latent Attention (MLA) while being up to 2x faster in some scenarios, reducing latency and increasing throughput in online benchmarks.</li></ul>

Hardware-Efficient Attention for Fast Decoding

Discover more