menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

SLED: A Sp...
source image

Arxiv

3d

read

207

img
dot

Image Credit: Arxiv

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

  • Efficient inferencing of large language models at the edge is challenging due to device limitations.
  • Existing strategies like quantization and pruning trade accuracy for efficiency or incur high costs.
  • A new approach called SLED leverages speculative decoding for efficient edge serving.
  • SLED orchestrates computation across heterogeneous devices for edge computing.
  • The method allows lightweight edge devices to draft multiple candidate tokens locally using diverse models.
  • A shared edge server efficiently batches and verifies tokens using a precise model.
  • SLED supports device heterogeneity and reduces server-side memory footprint by avoiding multiple target models.
  • Initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server show reduced latency, improved energy efficiency, and increased inference sessions.
  • The benefits are achieved without sacrificing model accuracy.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app