Optimizing systems for efficient Large Language Model (LLM) inference and AI agent workloads becomes critical as their demand is rapidly growing.
A new study bridges the gap between queuing theory and LLM system communities to develop queuing fundamentals for LLM inference.
The study proves that 'work-conserving' scheduling algorithms can achieve maximum throughput for individual requests and AI agent workloads.
Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, while FastTransformer and vanilla vLLM are not maximally stable and should be used with caution.