The NQKV algorithm aims to optimize the Key-Value (KV) cache memory resource consumption in Large Language Models (LLMs) during inference.
It quantizes the KV cache to even lower bits based on the normal distribution characteristics of the elements within each block of the cache.
NQKV allows the OPT model to operate with a larger batch size or longer context length, improving throughput by 9.3x without significant impact on model output quality.
Quantization to lower bits using NQKV addresses the bottleneck of memory resource consumption in LLMs during inference, enhancing efficiency in deployment.