Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance.
A flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method allows for higher representational freedom in LLMs by simultaneously accounting for the presence and distribution of outliers, resulting in an accuracy improvement of up to 36% compared to existing alternatives.
The introduction of a flexible, low-overhead digital compute-in-memory architecture (FlexCiM) enables diverse sparsity patterns in sparse models by adaptively aggregating and disaggregating smaller sub-macros, achieving up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators.
The code for the project is available at: https://github.com/FLOW-open-project/FLOW