Transformer-based large language models use the key-value cache to accelerate inference by storing past token embeddings, consuming significant GPU memory.
HashEvict introduced as a pre-attention KV cache eviction strategy uses locality-sensitive hashing to compress the cache by quickly locating tokens cosine dissimilar to the query token.
HashEvict computes Hamming distance between binarized Gaussian projections of current token query and cached token keys to make retention decisions pre-attention, reducing computational costs.
With HashEvict, the KV cache can be compressed by 30%-70% while maintaining high performance in reasoning, multiple-choice, long-context retrieval, and summarization tasks.