Scaling models in deep learning for decentralized training poses challenges due to communication bottlenecks, especially with model parallelism.
A new compression algorithm is proposed that compresses both forward and backward passes, achieving up to 99% compression with negligible memory/compute overhead.
By confining activations and gradients in a predefined low-dimensional subspace through a recursive structure in transformer networks, the method enables full reconstruction in subsequent layers.
This approach improves communication efficiency by up to 100x, allowing training billion-parameter-scale models over low-end GPUs with consumer-grade internet speeds, matching the convergence of centralized datacenter systems.