Key-Value cache ( exttt{KV} exttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving.
The paper comprehensively reviews existing algorithmic designs and benchmark studies, identifying missing performance measurement aspects that hinder practical adoption.
Representative exttt{KV} exttt{cache} compression methods are evaluated, uncovering issues that affect computational efficiency and end-to-end latency.
Tools are provided to aid future exttt{KV} exttt{cache} compression studies and facilitate practical deployment in production.