menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

PQCache: P...
source image

Arxiv

2d

read

151

img
dot

Image Credit: Arxiv

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

  • A new method called PQCache is proposed to address the memory bottleneck in Large Language Models (LLMs) inference.
  • PQCache employs Product Quantization (PQ) to manage the Key-Value Cache (KVCache) in LLMs, maintaining model quality while ensuring low serving latency.
  • PQCache applies PQ to tokens' keys during the prefilling phase and uses PQ codes and centroids to fetch key-value pairs during the autoregressive decoding phase.
  • Extensive experiments show that PQCache achieves improved model effectiveness and efficiency, with a 4.60% score improvement over existing methods.

Read Full Article

like

9 Likes

For uninterrupted reading, download the app