SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

January 27, 2023 · Declared Dead · 🏛 International Conference on Machine Learning

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov arXiv ID 2301.11913 Category cs.DC: Distributed Computing Cross-listed cs.LG Citations 57 Venue International Conference on Machine Learning Last Checked 2 months ago

Abstract

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Distributed Computing

R.I.P. 👻 Ghosted

TensorFlow: A system for large-scale machine learning

Martín Abadi, Paul Barham, ... (+20 more)

cs.DC 🏛 OSDI 📚 19.3K cites 9 years ago

R.I.P. 👻 Ghosted

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, Ashish Agarwal, ... (+38 more)

cs.DC 🏛 arXiv 📚 11.6K cites 10 years ago

R.I.P. 👻 Ghosted

Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains

Elli Androulaki, Artem Barger, ... (+19 more)

cs.DC 🏛 European Conference on Computer Systems 📚 4.0K cites 8 years ago

R.I.P. 👻 Ghosted

Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger

Duncan A. Brown, Karan Vahi, ... (+3 more)

cs.DC 🏛 Computing in science & engineering (Print) 📚 2.3K cites 5 years ago

R.I.P. 👻 Ghosted

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li, ... (+8 more)

cs.DC 🏛 arXiv 📚 2.3K cites 10 years ago

R.I.P. 👻 Ghosted

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

Vasimuddin Md, Sanchit Misra, ... (+2 more)

cs.DC 🏛 IPDPS 📚 2.3K cites 6 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, ... (+29 more)

cs.CL 🏛 NeurIPS 📚 54.2K cites 5 years ago

R.I.P. 👻 Ghosted

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, ... (+19 more)

cs.LG 🏛 NeurIPS 📚 49.7K cites 6 years ago

R.I.P. 👻 Ghosted

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin

cs.LG 🏛 KDD 📚 49.2K cites 10 years ago

R.I.P. 👻 Ghosted

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

cs.LG 🏛 ICML 📚 46.0K cites 11 years ago