<ul><li>Precision scaling with fewer bits is being used in pre-training LLMs to improve GPU efficiency without sacrificing accuracy.</li><li>NVIDIA's latest Blackwell GPUs employ Microscaling (MX) formats, combining narrow floating-point data types with per-block scaling factors for quantizing tensors.</li><li>While MX-formats offer improved numeric stability, careful usage is required to ensure successful convergence of LLMs on large datasets.</li><li>The study proposes an improved rounding mode using round-to-infinity to compute scaling factors, allowing successful pre-training in MXFP8 for an 8B model on 15T tokens.</li></ul>

Recipes for Pre-training LLMs with MXFP8

Discover more