A new method named KVmix is proposed for mixed-precision quantization of Key-Value (KV) Cache to address high memory demands in Large Language Models (LLMs) inference.
KVmix utilizes gradient-based importance analysis to allocate layer-specific bit-widths, prioritizing important layers while aggressively quantizing less critical ones.
It introduces a dynamic long-context optimization strategy to balance accuracy and efficiency by keeping full-precision KV pairs for recent pivotal tokens and compressing older ones.
KVmix achieves near-lossless inference performance on LLMs like Llama and Mistral with significant memory compression and speedup in inference throughput.