menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Accelerati...
source image

Arxiv

4d

read

214

img
dot

Image Credit: Arxiv

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

  • Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance.
  • A flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method allows for higher representational freedom in LLMs by simultaneously accounting for the presence and distribution of outliers, resulting in an accuracy improvement of up to 36% compared to existing alternatives.
  • The introduction of a flexible, low-overhead digital compute-in-memory architecture (FlexCiM) enables diverse sparsity patterns in sparse models by adaptively aggregating and disaggregating smaller sub-macros, achieving up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators.
  • The code for the project is available at: https://github.com/FLOW-open-project/FLOW

Read Full Article

like

12 Likes

For uninterrupted reading, download the app