A new method called PQCache is proposed to address the memory bottleneck in Large Language Models (LLMs) inference.
PQCache employs Product Quantization (PQ) to manage the Key-Value Cache (KVCache) in LLMs, maintaining model quality while ensuring low serving latency.
PQCache applies PQ to tokens' keys during the prefilling phase and uses PQ codes and centroids to fetch key-value pairs during the autoregressive decoding phase.
Extensive experiments show that PQCache achieves improved model effectiveness and efficiency, with a 4.60% score improvement over existing methods.