menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Unifying B...
source image

Arxiv

2d

read

218

img
dot

Image Credit: Arxiv

Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

  • Large language models (LLMs) face challenges when deployed on resource-constrained devices due to their rapid scaling.
  • There is a growing interest in extremely low-bit quantization, like 2-bit quantization, to address these challenges.
  • Prior works have shown that 2-bit LLMs are pareto-optimal over 4-bit models in accuracy and latency, particularly for pre-trained LLMs.
  • Existing advancements in 2-bit quantization have not been extended to instruction-tuned models.
  • To bridge this gap, Unified Progressive Quantization (UPQ) is proposed as a framework that combines block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for 2-bit instruction-tuned LLM quantization.
  • UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to reduce quantization error before further quantizing to INT2.
  • Distill-QAT is then applied in UPQ to ensure that INT2 instruction-tuned LLMs produce responses consistent with their original FP16 versions.
  • UPQ demonstrates the ability to quantize open-source instruction-tuned LLMs to 2-bit without proprietary post-training data.
  • UPQ achieves state-of-the-art performance on benchmarks like MMLU and IFEval that are used to evaluate instruction-tuned LLMs.

Read Full Article

like

13 Likes

For uninterrupted reading, download the app