Distributed machine learning workloads rely on the AllReduce collective to synchronize gradients or activations during training and inference.
A new algorithm called StragglAR has been proposed to accelerate distributed training and inference in the presence of persistent stragglers.
StragglAR implements a ReduceScatter among the remaining GPUs during delays caused by stragglers, achieving a 2x theoretical speedup over popular AllReduce algorithms for large GPU clusters.
On an 8-GPU server, StragglAR implementation has shown a 22% speedup compared to state-of-the-art AllReduce algorithms.