Large Language Models (LLMs) are known for their high performance but face challenges in practical deployment due to their large size.
Efforts have been made to apply traditional network pruning techniques to LLMs to reduce their size without impacting performance.
A new pruning methodology called Outlier Weighed Layerwise sparsity (OWL) has been introduced, which considers non-uniform layerwise sparsity ratios based on outlier ratios within each layer.
Empirical evaluations show that OWL outperforms previous methods, achieving significant performance gains and faster inference speeds at high sparsity levels.