Researchers have introduced DiLoCoX, a low-communication large-scale decentralized cluster training framework for distributed training of large language models.
DiLoCoX combines Pipeline Parallelism, Dual Optimizer Policy, One-Step-Delay Overlap of Communication, and Adaptive Gradient Compression Scheme to enhance scalability and speed of model pre-training.
The framework enables pre-training a 107B foundation model over a 1Gbps network, achieving a 357x speedup in distributed training compared to vanilla AllReduce with minimal impact on model convergence.
This marks the first successful application of a decentralized training framework to models exceeding 100 billion parameters.