Large language models (LLMs) have become pivotal in artificial intelligence, but their deployment on edge devices is hindered by their substantial size.
Quantization is a widely used method to reduce memory usage and inference time, but LLMs present unique challenges due to the prevalence of outliers in their activations.
In this work, the authors propose a method based on gradual binary search and the use of Hadamard matrices to address the challenges of activation quantization in LLMs.
The proposed method enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in improved model performance compared to state-of-the-art methods.