<ul data-eligibleForWebStory="true"><li>Efficient inferencing of large language models at the edge is challenging due to device limitations.</li><li>Existing strategies like quantization and pruning trade accuracy for efficiency or incur high costs.</li><li>A new approach called SLED leverages speculative decoding for efficient edge serving.</li><li>SLED orchestrates computation across heterogeneous devices for edge computing.</li><li>The method allows lightweight edge devices to draft multiple candidate tokens locally using diverse models.</li><li>A shared edge server efficiently batches and verifies tokens using a precise model.</li><li>SLED supports device heterogeneity and reduces server-side memory footprint by avoiding multiple target models.</li><li>Initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server show reduced latency, improved energy efficiency, and increased inference sessions.</li><li>The benefits are achieved without sacrificing model accuracy.</li></ul>

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Discover more