<ul><li>The rapid rise of large language models (LLMs) in text streaming services has posed cost and Quality of Experience (QoE) challenges in meeting real-time interaction requirements.</li><li>DiSCo is introduced as a device-server cooperative scheduler to enhance users' QoE by dynamically routing requests and transferring response generation between endpoints while considering cost limitations.</li><li>The scheduler uses cost-aware scheduling to leverage both on-device and server-based LLM inference, reducing tail Time-To-First-Token (TTFT) by 11-52% and mean TTFT by 6-78% across various model-device configurations.</li><li>DiSCo significantly reduces serving costs by up to 84% through its migration mechanism while maintaining comparable QoE levels, as validated by evaluations on real-world workloads.</li></ul>

DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Discover more