Quantization, a widely-used technique to make AI models more efficient, has limits and the industry may soon be approaching them. Quantization lowers the number of bits needed to represent information, but researchers have found that quantized models perform worse than the original unquantized versions if trained over a long period on lots of data, which spells bad news for AI firms training very large models. Scaling up models eventually provides diminishing returns and data curation and filtering may have an impact on efficacy.
Labs are reluctant to train models on smaller data sets and so researchers suggest that training models in low precision can make them more robust. The optimal balance has yet to be discovered, but low quantization precision will result in a noticeable step down in quality, unless the original model is incredibly large in terms of parameter count. There are no shortcuts and bit precision matters, according to the researchers, who believe that future innovations will focus on architectures designed for low-precision training.
The performance of quantized models is influenced by how models are trained and the precision of data types. Most models today are trained at 16-bit or half-precision before being post-train quantized to 8-bit precision. Low precision is seen as desirable for inference costs, but it has its limitations.
Contrary to popular belief, AI model inferencing is often more expensive in aggregate than model training. Google spent an estimated $191m training one of its Gemini models. But if the company used the model to generate 50-word answers to half of all Google Search queries, it would spend around $6bn a year.
Quantizing models with fewer bits representing their parameters are less demanding mathematically and computationally. But quantization may have more trade-offs than previously assumed.
The industry must move away from scaling up models and training on massive datasets because there are limitations that cannot be overcome.
In the future, architectures that deliberately aim to make low-precision training stable will be important and low-precision training will be useful in some scenarios.
AI models are not fully understood, and known shortcuts that work in many kinds of computation do not necessarily work in AI. Researchers of AI models believe that low quantization precision will result in a noticeable step down in quality, unless the original model is incredibly large in terms of parameter count.
Kumar and his colleagues' study was at a small scale, and they plan to test it with more models in the future. But he believes that there is no free lunch when it comes to reducing inference costs.
Efforts will be put into meticulous data curation and filtering so that only the highest quality data is put into smaller models. Kumar concludes that reducing bit precision is not sustainable and has its limitations.