menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

HashEvict:...
source image

Arxiv

1M

read

371

img
dot

Image Credit: Arxiv

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing

  • Transformer-based large language models use the key-value cache to accelerate inference by storing past token embeddings, consuming significant GPU memory.
  • HashEvict introduced as a pre-attention KV cache eviction strategy uses locality-sensitive hashing to compress the cache by quickly locating tokens cosine dissimilar to the query token.
  • HashEvict computes Hamming distance between binarized Gaussian projections of current token query and cached token keys to make retention decisions pre-attention, reducing computational costs.
  • With HashEvict, the KV cache can be compressed by 30%-70% while maintaining high performance in reasoning, multiple-choice, long-context retrieval, and summarization tasks.

Read Full Article

like

22 Likes

For uninterrupted reading, download the app