<ul><li>A new method called PQCache is proposed to address the memory bottleneck in Large Language Models (LLMs) inference.</li><li>PQCache employs Product Quantization (PQ) to manage the Key-Value Cache (KVCache) in LLMs, maintaining model quality while ensuring low serving latency.</li><li>PQCache applies PQ to tokens' keys during the prefilling phase and uses PQ codes and centroids to fetch key-value pairs during the autoregressive decoding phase.</li><li>Extensive experiments show that PQCache achieves improved model effectiveness and efficiency, with a 4.60% score improvement over existing methods.</li></ul>

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Discover more