Unstructured sparsity significantly improves KV cache compression for Large Language Models (LLMs), allowing sparsity levels up to 70% without impacting accuracy.
Per-token magnitude-based pruning proves highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes.
The Key cache benefits from outlier elements, while the Value cache benefits from simple magnitude-based pruning, despite its uniform distribution.
Utilizing bitmap-based sparse format and a custom attention kernel, the KV cache can be compressed to arbitrary sparsity patterns, leading to significant acceleration of memory-bound operations in decode computations and enabling longer context length and increased tokens per second throughput.