menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

NVIDIA Res...
source image

Marktechpost

4d

read

24

img
dot

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

  • NVIDIA and University of Edinburgh researchers introduce Dynamic Memory Sparsification (DMS) to compress KV caches in large language models (LLMs) for improved inference-time efficiency.
  • KV caches in Transformer-based models grow with sequence length and width, leading to significant memory consumption and slower inference.
  • Existing optimization techniques for KV caches have downsides, either hurting accuracy or being computationally expensive.
  • Dynamic Memory Sparsification (DMS) compresses KV caches efficiently with minimal training overhead and delayed eviction, preserving context information and model accuracy.
  • DMS makes eviction decisions differentiable during training using a Gumbel-sigmoid-based mechanism, allowing retained tokens to contribute their informational value effectively.
  • DMS requires no additional parameters per attention head, making it suitable for retrofitting existing models without architectural changes.
  • Empirical results show that DMS can achieve 8× KV cache compression with minimal retraining steps, improving model performance on reasoning tasks.
  • Benchmark results demonstrate DMS's superior performance on reasoning-heavy tasks like math, code generation, and science question answering.
  • DMS outperformed top baselines in KV cache read efficiency and peak memory usage, showcasing its effectiveness in scaling performance without increased costs.
  • DMS also performs well in non-reasoning tasks, maintaining high performance at compression ratios up to 4×.
  • Dynamic Memory Sparsification (DMS) offers a practical and scalable solution for improving Transformer-based LLMs' inference efficiency, balancing compression, accuracy, and ease of integration.

Read Full Article

like

1 Like

For uninterrupted reading, download the app