Precision scaling with fewer bits is being used in pre-training LLMs to improve GPU efficiency without sacrificing accuracy.
NVIDIA's latest Blackwell GPUs employ Microscaling (MX) formats, combining narrow floating-point data types with per-block scaling factors for quantizing tensors.
While MX-formats offer improved numeric stability, careful usage is required to ensure successful convergence of LLMs on large datasets.
The study proposes an improved rounding mode using round-to-infinity to compute scaling factors, allowing successful pre-training in MXFP8 for an 8B model on 15T tokens.