NestQuant is a new Post-training quantization (PTQ) method for efficient deployment of large language models (LLMs), based on self-similar nested lattices.
NestQuant is identified to be information-theoretically optimal for low-precision matrix multiplication, using a practical low-complexity version based on Gosset lattice.
It is a drop-in quantizer for any matrix multiplication step in LLMs, like self-attention, MLP, etc.
NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B model to 4 bits, achieving a perplexity of 6.6 on wikitext2.
This results in more than a 55% reduction in perplexity gap compared to unquantized models, outperforming state-of-the-art methods like Metas SpinQuant, OstQuant, and QuaRot.
Tests on larger models (up to 70B) and various LLM evaluation benchmarks consistently show NestQuant's superiority.