Efficient inferencing of large language models at the edge is challenging due to device limitations.
Existing strategies like quantization and pruning trade accuracy for efficiency or incur high costs.
A new approach called SLED leverages speculative decoding for efficient edge serving.
SLED orchestrates computation across heterogeneous devices for edge computing.
The method allows lightweight edge devices to draft multiple candidate tokens locally using diverse models.
A shared edge server efficiently batches and verifies tokens using a precise model.
SLED supports device heterogeneity and reduces server-side memory footprint by avoiding multiple target models.
Initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server show reduced latency, improved energy efficiency, and increased inference sessions.
The benefits are achieved without sacrificing model accuracy.