A new pruning method called T'yr-the-Pruner has been proposed to enhance hardware-agnostic inference efficiency for large language models (LLMs).
T'yr-the-Pruner is an end-to-end search-based global structural pruning framework that aims to determine the optimal sparsity distribution under a target overall sparsity ratio.
The framework constructs a supernet using local pruning and expectation error accumulation approaches, and employs an iterative prune-and-search strategy for efficient convergence.
Experimental results demonstrate that T'yr-the-Pruner achieves state-of-the-art structural pruning by retaining 97% of the dense model's performance while removing 50% of Llama-3.1-70B's parameters.