The rapid rise of large language models (LLMs) in text streaming services has posed cost and Quality of Experience (QoE) challenges in meeting real-time interaction requirements.
DiSCo is introduced as a device-server cooperative scheduler to enhance users' QoE by dynamically routing requests and transferring response generation between endpoints while considering cost limitations.
The scheduler uses cost-aware scheduling to leverage both on-device and server-based LLM inference, reducing tail Time-To-First-Token (TTFT) by 11-52% and mean TTFT by 6-78% across various model-device configurations.
DiSCo significantly reduces serving costs by up to 84% through its migration mechanism while maintaining comparable QoE levels, as validated by evaluations on real-world workloads.