Large language models (LLMs) face challenges when deployed on resource-constrained devices due to their rapid scaling.
There is a growing interest in extremely low-bit quantization, like 2-bit quantization, to address these challenges.
Prior works have shown that 2-bit LLMs are pareto-optimal over 4-bit models in accuracy and latency, particularly for pre-trained LLMs.
Existing advancements in 2-bit quantization have not been extended to instruction-tuned models.
To bridge this gap, Unified Progressive Quantization (UPQ) is proposed as a framework that combines block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for 2-bit instruction-tuned LLM quantization.
UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to reduce quantization error before further quantizing to INT2.
Distill-QAT is then applied in UPQ to ensure that INT2 instruction-tuned LLMs produce responses consistent with their original FP16 versions.
UPQ demonstrates the ability to quantize open-source instruction-tuned LLMs to 2-bit without proprietary post-training data.
UPQ achieves state-of-the-art performance on benchmarks like MMLU and IFEval that are used to evaluate instruction-tuned LLMs.