HALoS is a hierarchical asynchronous optimization framework designed for training large language models (LLMs) in geo-distributed environments.
It introduces local parameter servers (LPSs) within each region and a global parameter server (GPS) to minimize inter-region communication costs and improve training efficiency.
HALoS achieves faster convergence compared to synchronous baselines and existing asynchronous methods in geo-distributed LLM training, with up to 7.5x faster convergence and improvements of up to 2.1x.
The framework maintains model quality while reducing total training time, making it a powerful tool for scalable and efficient training of large language models in heterogeneous, geo-distributed settings.