menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Medha: Eff...
source image

Arxiv

1M

read

273

img
dot

Image Credit: Arxiv

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

  • Medha is a system that enables efficient long-context LLM inference without compromising on shorter request latencies or system efficiency.
  • Medha introduces three key innovations: adaptive chunking with slack-aware scheduling, Sequence Pipeline Parallelism (SPP), and KV Cache Parallelism (KVP).
  • Medha achieves unprecedented scale by supporting contexts up to 10M tokens with production-grade latency.
  • Evaluation shows that Medha reduces median latency by up to 30x compared to state-of-the-art systems when serving a mix of short and long requests, while improving throughput by upwards of 5x.

Read Full Article

like

16 Likes

For uninterrupted reading, download the app