LLM decoding faces challenges with large batches and long contexts due to loading the key-value cache from high-bandwidth memory, increasing per-token latency.
Redesigning attention to maximize hardware efficiency without sacrificing parallel scalability, Grouped-Tied Attention (GTA) combines and reuses key and value states, reducing memory transfers.
Grouped Latent Attention (GLA), paired with low-level optimizations, enhances parallel-friendly latent attention for fast decoding while maintaining high model quality.
Experiments show that GTA matches Grouped-Query Attention (GQA) quality using half the KV cache, and GLA matches Multi-head Latent Attention (MLA) while being up to 2x faster in some scenarios, reducing latency and increasing throughput in online benchmarks.