Model quantization involves converting model weights and activations from float32 to lower precision formats like float16 or int8.
Quantization to float16 is straightforward, while quantization to int8 involves mapping the wide range of float32 values to 256 integer values.
Two main quantization schemes are used: Affine Quantization Scheme for non-zero offset data and Symmetric Quantization Scheme for zero-centered data.
Different quantization methods like Dynamic Quantization, Static Quantization, and Quantization Aware Training are used to reduce model size, improve efficiency, and enable real-time AI on limited-resources.