<ul data-eligibleForWebStory="true">Distributed training is crucial for scaling the training of large neural network models like LLMs.Complexity of distributed training programs makes them prone to silent bugs.Common debugging practices using metrics may be inefficient for detecting such bugs.TTrace is designed to detect and localize silent bugs in distributed training effectively.TTrace collects intermediate tensors and compares them against a single-device reference to detect bugs.Novel mathematical analysis is proposed to compare floating-point values in tensors and set thresholds for bug detection.Experimental results show TTrace detects 11 existing bugs and 3 new bugs in Megatron-LM with minimal code changes.TTrace is effective in various training recipes, including low-precision scenarios with BF16 and FP8.