One approach to reducing the costs of large language models (LLMs) is through the use of quantized or sparse representations for training or deployment.
While post-training compression methods are popular, there is interest in obtaining more accurate compressed models by directly training over such representations with Quantization-Aware Training (QAT).
A recent study suggested that models can be trained using QAT at 8-bits weights and activations while maintaining accuracy.
A new method called QuEST advances the state-of-the-art by demonstrating optimality at 4-bits and stable convergence as low as 1-bit weights and activations.
QuEST achieves this through accurate and fast quantization of weights and activations using Hadamard normalization and MSE-optimal fitting, and a trust gradient estimator to minimize error between noisy and full-precision gradients.
Experiments show that QuEST induces stable scaling laws across various precisions and can be extended to sparse representations.
GPU kernel support is provided to efficiently execute models produced by QuEST.