<ul><li>Transformer-based large language models use the key-value cache to accelerate inference by storing past token embeddings, consuming significant GPU memory.</li><li>HashEvict introduced as a pre-attention KV cache eviction strategy uses locality-sensitive hashing to compress the cache by quickly locating tokens cosine dissimilar to the query token.</li><li>HashEvict computes Hamming distance between binarized Gaussian projections of current token query and cached token keys to make retention decisions pre-attention, reducing computational costs.</li><li>With HashEvict, the KV cache can be compressed by 30%-70% while maintaining high performance in reasoning, multiple-choice, long-context retrieval, and summarization tasks.</li></ul>

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing

Discover more