Medha is a system that enables efficient long-context LLM inference without compromising on shorter request latencies or system efficiency.
Medha introduces three key innovations: adaptive chunking with slack-aware scheduling, Sequence Pipeline Parallelism (SPP), and KV Cache Parallelism (KVP).
Medha achieves unprecedented scale by supporting contexts up to 10M tokens with production-grade latency.
Evaluation shows that Medha reduces median latency by up to 30x compared to state-of-the-art systems when serving a mix of short and long requests, while improving throughput by upwards of 5x.