<ul><li>A new method named KVmix is proposed for mixed-precision quantization of Key-Value (KV) Cache to address high memory demands in Large Language Models (LLMs) inference.</li><li>KVmix utilizes gradient-based importance analysis to allocate layer-specific bit-widths, prioritizing important layers while aggressively quantizing less critical ones.</li><li>It introduces a dynamic long-context optimization strategy to balance accuracy and efficiency by keeping full-precision KV pairs for recent pivotal tokens and compressing older ones.</li><li>KVmix achieves near-lossless inference performance on LLMs like Llama and Mistral with significant memory compression and speedup in inference throughput.</li></ul>

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Discover more