Large Language Models (LLMs) have shown high performance on reasoning benchmarks such as GSM8K, MATH, and AIME.
Model quantization is being used to reduce memory usage and inference time, but it can degrade mathematical reasoning accuracy by up to 69.81%.
A study has been conducted on mainstream quantization methods and popular open-source models to understand and categorize the errors caused by quantization.
An automated data-curation pipeline has been developed to create a compact dataset that, when used to train a quantized model, can restore its reasoning accuracy within a few minutes on a single GPU.