Research on large language model compression has focused on methods like quantization, sparsification, and structured pruning to reduce computational costs.
A new approach called EvoPress introduces dynamic, non-uniform compression methods that adjust compression levels per-block or per-layer to minimize accuracy loss while meeting a global compression threshold.
EvoPress uses an evolutionary framework to identify optimal compression profiles efficiently, challenging the assumption that compression error is independent across layers in language models.
The EvoPress framework achieves state-of-the-art results in dynamic compression of various models like Llama, Mistral, and Phi through techniques such as structural pruning, sparsity, and quantization with dynamic bitwidths.