<ul><li>LUMION introduces a reconfigurable optical fabric, addressing resource inefficiency in modern ML datacenters when accelerators fail.</li><li>Operators traditionally migrate affected ML jobs to new racks, leading to rack reservations of idle accelerators for fault tolerance.</li><li>LUMION dynamically integrates spare accelerators into ongoing workloads as failures occur, maintaining consistent performance without costly migrations.</li><li>Experiments show LUMION's ability to swap failed GPUs with healthy ones and restart ML jobs within approximately 1 second, achieving higher inter-GPU bandwidth and nearly 2X improvement in fine-tuning throughput.</li></ul>

LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics

Discover more