<ul><li>Large language models (LLMs) have become pivotal in artificial intelligence, but their deployment on edge devices is hindered by their substantial size.</li><li>Quantization is a widely used method to reduce memory usage and inference time, but LLMs present unique challenges due to the prevalence of outliers in their activations.</li><li>In this work, the authors propose a method based on gradual binary search and the use of Hadamard matrices to address the challenges of activation quantization in LLMs.</li><li>The proposed method enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in improved model performance compared to state-of-the-art methods.</li></ul>

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Discover more