Large Language Model (LLM) inference is a computationally intensive process that requires efficient scheduling to optimize latency and resource utilization.
Managing the Key-Value (KV) cache is a key challenge in LLM inference as it reduces redundant computations but introduces memory constraints.
A novel batching and scheduling algorithm is proposed in this work to minimize inference latency while effectively managing the KV cache's memory.
Empirical evaluations show that the proposed algorithm outperforms benchmark algorithms in real-world LLM inference dataset simulations, leading to more sustainable and cost-effective LLM deployment.