<ul><li>Medha is a system that enables efficient long-context LLM inference without compromising on shorter request latencies or system efficiency.</li><li>Medha introduces three key innovations: adaptive chunking with slack-aware scheduling, Sequence Pipeline Parallelism (SPP), and KV Cache Parallelism (KVP).</li><li>Medha achieves unprecedented scale by supporting contexts up to 10M tokens with production-grade latency.</li><li>Evaluation shows that Medha reduces median latency by up to 30x compared to state-of-the-art systems when serving a mix of short and long requests, while improving throughput by upwards of 5x.</li></ul>

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

Discover more