<ul><li>The NQKV algorithm aims to optimize the Key-Value (KV) cache memory resource consumption in Large Language Models (LLMs) during inference.</li><li>It quantizes the KV cache to even lower bits based on the normal distribution characteristics of the elements within each block of the cache.</li><li>NQKV allows the OPT model to operate with a larger batch size or longer context length, improving throughput by 9.3x without significant impact on model output quality.</li><li>Quantization to lower bits using NQKV addresses the bottleneck of memory resource consumption in LLMs during inference, enhancing efficiency in deployment.</li></ul>

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Discover more