Efficient LLM inference can be achieved through SentenceKV, a novel sentence-level semantic KV caching approach.
SentenceKV addresses the limitations of traditional token-level caching methods by considering semantic relationships between tokens.
By compressing sentence representations into concise semantic vectors, stored on the GPU, SentenceKV reduces memory overhead and improves computational efficiency.
Extensive evaluations show that SentenceKV outperforms existing methods in terms of efficiency, memory usage, and model accuracy.