Researchers at Princeton University published a technical paper titled 'Hardware-Efficient Attention for Fast Decoding.'
The paper introduces Grouped-Tied Attention (GTA) and Grouped Latent Attention (GLA) to optimize attention mechanisms for efficient decoding.
Experiments show that GTA and GLA achieve quality comparable to existing methods while reducing memory usage and enhancing parallel scalability.
The GLA kernel outperforms FlashMLA in speculative decoding scenarios, leading to reduced latency and increased throughput in online serving benchmarks.