menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Mustafar: ...
source image

Arxiv

3d

read

55

img
dot

Image Credit: Arxiv

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

  • Unstructured sparsity significantly improves KV cache compression for Large Language Models (LLMs), allowing sparsity levels up to 70% without impacting accuracy.
  • Per-token magnitude-based pruning proves highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes.
  • The Key cache benefits from outlier elements, while the Value cache benefits from simple magnitude-based pruning, despite its uniform distribution.
  • Utilizing bitmap-based sparse format and a custom attention kernel, the KV cache can be compressed to arbitrary sparsity patterns, leading to significant acceleration of memory-bound operations in decode computations and enabling longer context length and increased tokens per second throughput.

Read Full Article

like

3 Likes

For uninterrupted reading, download the app