Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference, but the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints.
The proposed MorphKV technique maintains a constant-sized KV cache while preserving accuracy by adaptively ranking tokens through correlation-aware selection.
MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens, capturing inter-token correlation with greater accuracy.
Studies show 52.9% memory savings and 18.2% higher accuracy compared to prior works, making MorphKV suitable for efficient real-world deployment.