Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources.
To effectively scale model size, training data, and total computation in large-scale distributed training, careful consideration of hardware configuration and parallelization strategy is critical.
An extensive empirical study of large-scale language model training workloads reveals that certain distributed communication strategies, previously considered sub-optimal, can become preferable at certain scales.
Scaling the total number of hardware accelerators for large model training yields diminishing returns, even with optimized hardware and parallelization strategies, resulting in poor marginal performance per additional unit of power or GPU-hour.