Training large language models typically involves optimization methods on clusters with tens of thousands of accelerators.
Scaling up clusters can be costly, limiting the size of models that can be trained.
New method called NoLoCo proposed to reduce communication needs during training.
NoLoCo optimizes model weights without requiring collective communication or explicit synchronization of all model parameters.
It synchronizes model weights implicitly using a variant of Nesterov momentum optimizer by averaging weights with a randomly selected other one.
The proposed optimizer, NoLoCo, is supported by theoretical convergence analysis and empirical results from language model training.
Benchmarking shows that NoLoCo requires less communication overhead compared to other training methods like fully sharded data parallel training and DiLoCo.
NoLoCo achieves faster synchronization speeds over the internet, reducing accelerator idling time.
Compared to DiLoCo, NoLoCo demonstrates up to 4% faster convergence rates across various model sizes and accelerator counts.