<ul data-eligibleForWebStory="false"><li>NestQuant is a new Post-training quantization (PTQ) method for efficient deployment of large language models (LLMs), based on self-similar nested lattices.</li><li>NestQuant is identified to be information-theoretically optimal for low-precision matrix multiplication, using a practical low-complexity version based on Gosset lattice.</li><li>It is a drop-in quantizer for any matrix multiplication step in LLMs, like self-attention, MLP, etc.</li><li>NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B model to 4 bits, achieving a perplexity of 6.6 on wikitext2.</li><li>This results in more than a 55% reduction in perplexity gap compared to unquantized models, outperforming state-of-the-art methods like Metas SpinQuant, OstQuant, and QuaRot.</li><li>Tests on larger models (up to 70B) and various LLM evaluation benchmarks consistently show NestQuant's superiority.</li></ul>

NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

Discover more