LUMION introduces a reconfigurable optical fabric, addressing resource inefficiency in modern ML datacenters when accelerators fail.
Operators traditionally migrate affected ML jobs to new racks, leading to rack reservations of idle accelerators for fault tolerance.
LUMION dynamically integrates spare accelerators into ongoing workloads as failures occur, maintaining consistent performance without costly migrations.
Experiments show LUMION's ability to swap failed GPUs with healthy ones and restart ML jobs within approximately 1 second, achieving higher inter-GPU bandwidth and nearly 2X improvement in fine-tuning throughput.