<ul data-eligibleForWebStory="true"><li>Training large language models typically involves optimization methods on clusters with tens of thousands of accelerators.</li><li>Scaling up clusters can be costly, limiting the size of models that can be trained.</li><li>New method called NoLoCo proposed to reduce communication needs during training.</li><li>NoLoCo optimizes model weights without requiring collective communication or explicit synchronization of all model parameters.</li><li>It synchronizes model weights implicitly using a variant of Nesterov momentum optimizer by averaging weights with a randomly selected other one.</li><li>The proposed optimizer, NoLoCo, is supported by theoretical convergence analysis and empirical results from language model training.</li><li>Benchmarking shows that NoLoCo requires less communication overhead compared to other training methods like fully sharded data parallel training and DiLoCo.</li><li>NoLoCo achieves faster synchronization speeds over the internet, reducing accelerator idling time.</li><li>Compared to DiLoCo, NoLoCo demonstrates up to 4% faster convergence rates across various model sizes and accelerator counts.</li></ul>

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Discover more