<ul><li>Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance.</li><li>A flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method allows for higher representational freedom in LLMs by simultaneously accounting for the presence and distribution of outliers, resulting in an accuracy improvement of up to 36% compared to existing alternatives.</li><li>The introduction of a flexible, low-overhead digital compute-in-memory architecture (FlexCiM) enables diverse sparsity patterns in sparse models by adaptively aggregating and disaggregating smaller sub-macros, achieving up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators.</li><li>The code for the project is available at: https://github.com/FLOW-open-project/FLOW</li></ul>

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

Discover more