Apt-Serve is a scalable framework designed to enhance effective throughput in large language model (LLM) inference serving systems.
It addresses the bottleneck caused by a memory-intensive KV cache and rigid batch composition in existing systems.
Apt-Serve combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing larger batch sizes and improved request concurrency.
Evaluations show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to state-of-the-art inference serving systems.