Quantization-aware training (QAT) is a method that reduces model precision while maintaining performance of large language models (LLMs), addressing computational and memory challenges.
A unified scaling law for QAT was proposed in a recent paper, considering factors like model size, training data volume, and quantization group size.
Through 268 QAT experiments, it was shown that quantization error decreases with larger model sizes, but increases with more training tokens and coarser quantization granularity.
The primary bottleneck in 4-bit precision QAT was identified in the FC2 layer due to activation quantization errors caused by outliers, suggesting the importance of addressing these errors for improvement.