<ul data-eligibleForWebStory="true"><li>One approach to reducing the costs of large language models (LLMs) is through the use of quantized or sparse representations for training or deployment.</li><li>While post-training compression methods are popular, there is interest in obtaining more accurate compressed models by directly training over such representations with Quantization-Aware Training (QAT).</li><li>A recent study suggested that models can be trained using QAT at 8-bits weights and activations while maintaining accuracy.</li><li>A new method called QuEST advances the state-of-the-art by demonstrating optimality at 4-bits and stable convergence as low as 1-bit weights and activations.</li><li>QuEST achieves this through accurate and fast quantization of weights and activations using Hadamard normalization and MSE-optimal fitting, and a trust gradient estimator to minimize error between noisy and full-precision gradients.</li><li>Experiments show that QuEST induces stable scaling laws across various precisions and can be extended to sparse representations.</li><li>GPU kernel support is provided to efficiently execute models produced by QuEST.</li></ul>

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Discover more