<ul><li>Post-training model quantization is widely used to reduce memory and computational costs of large language models.</li><li>A novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy is proposed.</li><li>The proposed BAQ algorithm achieves a good trade-off between loss minimization and complexity for large language models.</li><li>Experimental results show that BAQ outperforms GPTQ, achieving up to 56 times lower perplexity at the same bitwidth.</li></ul>

BAQ: Efficient Bit Allocation Quantization for Large Language Models

Discover more