<ul><li>Scaling models in deep learning for decentralized training poses challenges due to communication bottlenecks, especially with model parallelism.</li><li>A new compression algorithm is proposed that compresses both forward and backward passes, achieving up to 99% compression with negligible memory/compute overhead.</li><li>By confining activations and gradients in a predefined low-dimensional subspace through a recursive structure in transformer networks, the method enables full reconstruction in subsequent layers.</li><li>This approach improves communication efficiency by up to 100x, allowing training billion-parameter-scale models over low-end GPUs with consumer-grade internet speeds, matching the convergence of centralized datacenter systems.</li></ul>

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Discover more