menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Apt-Serve:...
source image

Arxiv

1w

read

243

img
dot

Image Credit: Arxiv

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving

  • Apt-Serve is a scalable framework designed to enhance effective throughput in large language model (LLM) inference serving systems.
  • It addresses the bottleneck caused by a memory-intensive KV cache and rigid batch composition in existing systems.
  • Apt-Serve combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing larger batch sizes and improved request concurrency.
  • Evaluations show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to state-of-the-art inference serving systems.

Read Full Article

like

14 Likes

For uninterrupted reading, download the app