<ul><li>Optimizing systems for efficient Large Language Model (LLM) inference and AI agent workloads becomes critical as their demand is rapidly growing.</li><li>A new study bridges the gap between queuing theory and LLM system communities to develop queuing fundamentals for LLM inference.</li><li>The study proves that 'work-conserving' scheduling algorithms can achieve maximum throughput for individual requests and AI agent workloads.</li><li>Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, while FastTransformer and vanilla vLLM are not maximally stable and should be used with caution.</li></ul>

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

Discover more