NVIDIA and University of Edinburgh researchers introduce Dynamic Memory Sparsification (DMS) to compress KV caches in large language models (LLMs) for improved inference-time efficiency.
KV caches in Transformer-based models grow with sequence length and width, leading to significant memory consumption and slower inference.
Existing optimization techniques for KV caches have downsides, either hurting accuracy or being computationally expensive.
Dynamic Memory Sparsification (DMS) compresses KV caches efficiently with minimal training overhead and delayed eviction, preserving context information and model accuracy.
DMS makes eviction decisions differentiable during training using a Gumbel-sigmoid-based mechanism, allowing retained tokens to contribute their informational value effectively.
DMS requires no additional parameters per attention head, making it suitable for retrofitting existing models without architectural changes.
Empirical results show that DMS can achieve 8× KV cache compression with minimal retraining steps, improving model performance on reasoning tasks.
Benchmark results demonstrate DMS's superior performance on reasoning-heavy tasks like math, code generation, and science question answering.
DMS outperformed top baselines in KV cache read efficiency and peak memory usage, showcasing its effectiveness in scaling performance without increased costs.
DMS also performs well in non-reasoning tasks, maintaining high performance at compression ratios up to 4×.
Dynamic Memory Sparsification (DMS) offers a practical and scalable solution for improving Transformer-based LLMs' inference efficiency, balancing compression, accuracy, and ease of integration.