On Biased Compression for Distributed Learning
February 27, 2020 Β· Declared Dead Β· π Journal of machine learning research
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Aleksandr Beznosikov, Samuel HorvΓ‘th, Peter RichtΓ‘rik, Mher Safaryan
arXiv ID
2002.12410
Category
cs.LG: Machine Learning
Cross-listed
cs.DC,
math.OC,
stat.ML
Citations
223
Venue
Journal of machine learning research
Last Checked
2 months ago
Abstract
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( Ξ΄L \exp \left[-\frac{ΞΌK}{Ξ΄L}\right] + \frac{(C + Ξ΄D)}{KΞΌ}\right)$, where $Ξ΄\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $ΞΌ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Machine Learning
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
π»
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
π»
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
π»
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
π»
Ghosted
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
R.I.P.
π»
Ghosted
A Unified Approach to Interpreting Model Predictions
R.I.P.
π»
Ghosted