<ul data-eligibleForWebStory="true"><li>Distributed training is crucial for scaling the training of large neural network models like LLMs.</li><li>Complexity of distributed training programs makes them prone to silent bugs.</li><li>Common debugging practices using metrics may be inefficient for detecting such bugs.</li><li>TTrace is designed to detect and localize silent bugs in distributed training effectively.</li><li>TTrace collects intermediate tensors and compares them against a single-device reference to detect bugs.</li><li>Novel mathematical analysis is proposed to compare floating-point values in tensors and set thresholds for bug detection.</li><li>Experimental results show TTrace detects 11 existing bugs and 3 new bugs in Megatron-LM with minimal code changes.</li><li>TTrace is effective in various training recipes, including low-precision scenarios with BF16 and FP8.</li></ul>

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training

Discover more