Distributed Optimization Based on Gradient-tracking Revisited: Enhancing Convergence Rate via Surrogation

May 07, 2019 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Ying Sun, Amir Daneshmand, Gesualdo Scutari arXiv ID 1905.02637 Category math.OC: Optimization & Control Cross-listed cs.DC Citations 56 Venue arXiv.org Last Checked 2 months ago

Abstract

We study distributed multiagent optimization over (directed, time-varying) graphs. We consider the minimization of $F+G$ subject to convex constraints, where $F$ is the smooth strongly convex sum of the agent's losses and $G$ is a nonsmooth convex function. We build on the SONATA algorithm: the algorithm employs the use of surrogate objective functions in the agents' subproblems (going thus beyond linearization, such as proximal-gradient) coupled with a perturbed (push-sum) consensus mechanism that aims to track locally the gradient of $F$. SONATA achieves precision $ε>0$ on the objective value in $\mathcal{O}(κ_g \log(1/ε))$ gradient computations at each node and $\tilde{\mathcal{O}}\big(κ_g (1-ρ)^{-1/2} \log(1/ε)\big)$ communication steps, where $κ_g$ is the condition number of $F$ and $ρ$ characterizes the connectivity of the network. This is the first linear rate result for distributed composite optimization; it also improves on existing (non-accelerated) schemes just minimizing $F$, whose rate depends on much larger quantities than $κ_g$ (e.g., the worst-case condition number among the agents). When considering in particular empirical risk minimization problems with statistically similar data across the agents, SONATA employing high-order surrogates achieves precision $ε>0$ in $\mathcal{O}\big((β/μ) \log(1/ε)\big)$ iterations and $\tilde{\mathcal{O}}\big((β/μ) (1-ρ)^{-1/2} \log(1/ε)\big)$ communication steps, where $β$ measures the degree of similarity of the agents' losses and $μ$ is the strong convexity constant of $F$. Therefore, when $β/μ< κ_g$, the use of high-order surrogates yields provably faster rates than what achievable by first-order models; this is without exchanging any Hessian matrix over the network.