menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

TTrace: Li...
source image

Arxiv

3d

read

203

img
dot

Image Credit: Arxiv

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training

  • Distributed training is crucial for scaling the training of large neural network models like LLMs.
  • Complexity of distributed training programs makes them prone to silent bugs.
  • Common debugging practices using metrics may be inefficient for detecting such bugs.
  • TTrace is designed to detect and localize silent bugs in distributed training effectively.
  • TTrace collects intermediate tensors and compares them against a single-device reference to detect bugs.
  • Novel mathematical analysis is proposed to compare floating-point values in tensors and set thresholds for bug detection.
  • Experimental results show TTrace detects 11 existing bugs and 3 new bugs in Megatron-LM with minimal code changes.
  • TTrace is effective in various training recipes, including low-precision scenarios with BF16 and FP8.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app