<ul><li>Apt-Serve is a scalable framework designed to enhance effective throughput in large language model (LLM) inference serving systems.</li><li>It addresses the bottleneck caused by a memory-intensive KV cache and rigid batch composition in existing systems.</li><li>Apt-Serve combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing larger batch sizes and improved request concurrency.</li><li>Evaluations show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to state-of-the-art inference serving systems.</li></ul>

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving

Discover more