<ul><li>Large Language Model (LLM) inference is a computationally intensive process that requires efficient scheduling to optimize latency and resource utilization.</li><li>Managing the Key-Value (KV) cache is a key challenge in LLM inference as it reduces redundant computations but introduces memory constraints.</li><li>A novel batching and scheduling algorithm is proposed in this work to minimize inference latency while effectively managing the KV cache's memory.</li><li>Empirical evaluations show that the proposed algorithm outperforms benchmark algorithms in real-world LLM inference dataset simulations, leading to more sustainable and cost-effective LLM deployment.</li></ul>

Online Scheduling for LLM Inference with KV Cache Constraints

Discover more