menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

NoLoCo: No...
source image

Arxiv

2d

read

148

img
dot

Image Credit: Arxiv

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

  • Training large language models typically involves optimization methods on clusters with tens of thousands of accelerators.
  • Scaling up clusters can be costly, limiting the size of models that can be trained.
  • New method called NoLoCo proposed to reduce communication needs during training.
  • NoLoCo optimizes model weights without requiring collective communication or explicit synchronization of all model parameters.
  • It synchronizes model weights implicitly using a variant of Nesterov momentum optimizer by averaging weights with a randomly selected other one.
  • The proposed optimizer, NoLoCo, is supported by theoretical convergence analysis and empirical results from language model training.
  • Benchmarking shows that NoLoCo requires less communication overhead compared to other training methods like fully sharded data parallel training and DiLoCo.
  • NoLoCo achieves faster synchronization speeds over the internet, reducing accelerator idling time.
  • Compared to DiLoCo, NoLoCo demonstrates up to 4% faster convergence rates across various model sizes and accelerator counts.

Read Full Article

like

8 Likes

For uninterrupted reading, download the app