<ul><li>Distributed machine learning workloads rely on the AllReduce collective to synchronize gradients or activations during training and inference.</li><li>A new algorithm called StragglAR has been proposed to accelerate distributed training and inference in the presence of persistent stragglers.</li><li>StragglAR implements a ReduceScatter among the remaining GPUs during delays caused by stragglers, achieving a 2x theoretical speedup over popular AllReduce algorithms for large GPU clusters.</li><li>On an 8-GPU server, StragglAR implementation has shown a 22% speedup compared to state-of-the-art AllReduce algorithms.</li></ul>

Accelerating AllReduce with a Persistent Straggler

Discover more