menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

LUMION: Fa...
source image

Arxiv

3d

read

159

img
dot

Image Credit: Arxiv

LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics

  • LUMION introduces a reconfigurable optical fabric, addressing resource inefficiency in modern ML datacenters when accelerators fail.
  • Operators traditionally migrate affected ML jobs to new racks, leading to rack reservations of idle accelerators for fault tolerance.
  • LUMION dynamically integrates spare accelerators into ongoing workloads as failures occur, maintaining consistent performance without costly migrations.
  • Experiments show LUMION's ability to swap failed GPUs with healthy ones and restart ML jobs within approximately 1 second, achieving higher inter-GPU bandwidth and nearly 2X improvement in fine-tuning throughput.

Read Full Article

like

9 Likes

For uninterrupted reading, download the app