<ul><li>Unstructured sparsity significantly improves KV cache compression for Large Language Models (LLMs), allowing sparsity levels up to 70% without impacting accuracy.</li><li>Per-token magnitude-based pruning proves highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes.</li><li>The Key cache benefits from outlier elements, while the Value cache benefits from simple magnitude-based pruning, despite its uniform distribution.</li><li>Utilizing bitmap-based sparse format and a custom attention kernel, the KV cache can be compressed to arbitrary sparsity patterns, leading to significant acceleration of memory-bound operations in decode computations and enabling longer context length and increased tokens per second throughput.</li></ul>

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Discover more