<ul><li>Autoregressive Transformers use Key-Value (KV) caching for inference acceleration but face issues with linear growth of cache leading to excessive memory consumption.</li><li>MorphKV is an inference-time technique proposed to maintain a constant-sized KV cache while preserving accuracy and balancing long-range dependencies and local coherence during text generation.</li><li>MorphKV eliminates early-token bias, retains high-fidelity context, and captures inter-token correlation more accurately by adaptively ranking tokens through correlation-aware selection.</li><li>Studies show that MorphKV results in 52.9% memory savings and 18.2% higher accuracy on average compared to existing methods, making it suitable for real-time applications like content creation and code generation.</li></ul>

Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

Discover more