Autoregressive Transformers use Key-Value (KV) caching for inference acceleration but face issues with linear growth of cache leading to excessive memory consumption.
MorphKV is an inference-time technique proposed to maintain a constant-sized KV cache while preserving accuracy and balancing long-range dependencies and local coherence during text generation.
MorphKV eliminates early-token bias, retains high-fidelity context, and captures inter-token correlation more accurately by adaptively ranking tokens through correlation-aware selection.
Studies show that MorphKV results in 52.9% memory savings and 18.2% higher accuracy on average compared to existing methods, making it suitable for real-time applications like content creation and code generation.