Stochastic Optimization for Learning over Networks Guanghui - - PowerPoint PPT Presentation
Stochastic Optimization for Learning over Networks Guanghui - - PowerPoint PPT Presentation
Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Machine Learning
Given a set of observed data S = {(ui, vi)}m
i=1, drawn from a
certain unknown distribution D on U × V. Goal: to describe the relation between ui and vi’s for prediction. Applications: predicting strokes and seizures, identifying heart failure, stopping credit card fraud, predicting machine failure, identifying spam, ...... Classic models:
Lasso regression: minx E[(x, u − v)2] + ρx1. Support vector machine: min Eu,v [max{0, vx, u] + ρx2
2.
Deep learning: minx Eu,v(F(u, x) − v)2 + ρUx1
2 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Machine learning and stochastic optimization
Generic stochastic optimization model: minx∈X {f(x) := Eξ[F(x, ξ)]} . In ML, F is the regularized loss function and ξ = (u, v): F(x, ξ) = (x, u − v)2 + ρx1 F(x, ξ) = max{0, vx, u} + ρx2
2.
To compute the gradient of f is expensive or impossible. Stochastic first-order methods: iterative methods which
- perate with the stochastic gradients (subgradients) of f.
3 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Learning over networks
How about data are distributed over a multi-agent network? minx f(x) := m
i=1fi(x)
s.t. x ∈ X, X := ∩m
i=1Xi,
where fi(x) = E[Fi(x, ζi)] is given in the form of expectation. Optimization defined over complex multi-agent network. Each agent has its own data (observations of ζi). Data usually are private - no sharing. Can share knowledge learned from data. Communication among agents can be expensive. Data can be captured on-line.
4 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Example: SVM over networks
Three agents: minx 1
3[f1(x) + f2(x) + f3(x)]
f1(x) =
1 N1
N1
i=1
- max{0, v1
i x, u1 i
- + ρx2
2.
f2(x) =
1 N2
N2
i=1
- max{0, v2
i x, u2 i
- + ρx2
2.
f3(x) =
1 N3
N3
i=1
- max{0, v3
i x, u3 i
- + ρx2
2.
Dataset for agent 1: {(u1
i , v1 i )}N1 i=1.
Dataset for agent 2: {(u2
i , v2 i )}N2 i=1.
Dataset for agent 3: {(u2
i , v2 i )}N3 i=1.
Each agent accesses its own dataset.
5 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Example: SVM over networks with streaming data
minx 1
3[f1(x) + f2(x) + f3(x)],
where fj(x) = E
- max{0, vjx, uj
- + ρx2
2, j = 1, 2, 3.
Dataset for agent i can be viewed as samples of random vector (uj, vj), j = 1, 2, 3. (uj, vj) can satisfy different distribution. Samples can be collected in an online fashion. Agents can possibly share solutions, but not the samples. Need to minimize the communication costs. Key questions # samples - sampling complexity # communication rounds - communication complexity
6 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Network topology?
7 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Centralized stochastic gradient descent
Sampling complexity
Distributed SGD and federated learning
Sampling complexity Communication complexity
Decentralized SGD
How to communicate Sampling complexity Communication complexity
8 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Stochastic (sub)gradients
The Problem: minx∈X{f(x) := Eξ[F(x, ξ)]}. Stochastic (sub)gradients At iteration t, xt ∈ X being the input, we have access to a vector G(xt, ξt), where {ξt}t≥1 are i.i.d. random variables s.t. E[G(xt, ξt)] ≡ g(xt) ∈ ∂Ψ(xt), E[G(x, ξ)2] ≤ M2.
Examples: Regression with batch data: minx f(x) = 1
N
N
i=1(x, ui − vi)2
Stochastic gradient: 2(x, uit − vit)uit
Regression with streaming data: minx f(x) = E
- (x, u − v)2
Stochastic gradient: 2(x, u − v)u
9 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Stochastic (sub)grdient descent
The algorithm xt+1 = argminx∈Xx − (xt − γtGt)2
2, t = 1, 2, . . .
Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09)) Let DX ≥ maxx1,x2∈X x1 − x22. If γt =
- Ω2/(kM2), t = 1, . . . , k, and
¯ xk = k
t=1xt/k,
E[f(¯ xk) − f ∗] ≤ MDX
2 √ k , ∀k ≥ 1.
Sampling complexity # samples = # iterations = O(1)( M2D2
X
ǫ2 ),
to find a solution ¯ x ∈ X such that E[f(¯ x) − f ∗] ≤ ǫ.
10 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Recent developments
Accelerated SGD (Lan 08 (12))
Stochastic version of Nesterov’s accelerated gradient method A universally optimal method for smooth, nonsmooth and stochastic optimization The impact of Lipschitz constants vanishes for stochastic problems Popular in training deep neural networks (Sutskever, Martens, Dahl, Hinton 13)
Adaptive stochastic subgradient (Duchi, Hazan, Singer 11) Nonconvex SGD and its acceleration (Ghadimi and Lan 12) Adaptive sample sizes (Byrd, Chin, Nocedal and Wu 12) SGD for finite-sum problems (Schmidt, Roux and Bach 13) Optimal incremental gradient methods (Lan and Zhou 15)
11 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
The distributed structure - star topology
Figure: A cloud-device based distributed learning system
Data sets are distributed
- ver individual agents
(devices) in the network. All devices are connected to a parameter server (or central cloud), which controls the learning process and updates solutions. One example: federated learning.
12 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Stochastic finite sum optimization
Consider the convex programming (CP) problem given by min
x∈X Ψ(x) := 1 m
m
i=1fi(x) + µ ω(x).
X ⊆ Rn is a closed convex set. fi : Rn → R, i = 1, . . . , m, are smooth convex with Lipschitz constants Li ≥ 0. f(x) := 1
m
m
i=1fi(x) is smooth convex
with Lipschitz constant Lf ≤ L = 1
m
m
i=1Li.
ω : X → R is a strongly convex function with modulus 1 w.r.t. an arbitrary norm · . µ ≥ 0 is a given constant. fi(x) = E[Fi(x, ξi)] can be represented by a stochastic
- racle, providing stochastic (sub)gradients upon request.
13 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation
Randomized incremental gradient (RIG) methods
Randomized incremental gradient (RIG) methods solve minx∈X 1
m
m
i=1fi(x) iteratively, at k-th iteration,
1) Randomly select a component index ik from 1, . . . , m (server). 2) Compute the gradient of the component function fik(xk) (agents). 3) Set xk+1 = PX(xk − αk∇fik(xk)), where αk is a positive step-size, PX(·) denotes projection on X (server). Potentially save the total number of gradient computations. Save communication costs in distributed setting .
14 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation
Existing RIG methods
SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, and SVRG in Johnson and Zhang, 13 obtained O
- (m + L/µ) log 1
ǫ
- rate of convergence rate of convergence not optimal
RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14, Dang and Lan 14) require exact gradient evaluation at the initial point, and differentiability over Rn Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16 require re-evaluating exact gradients from time to time synchronous delays No existing studies on stochastic finite-sum: each fi is represented by a stochastic oracle, or each agent only has access to noisy first-order information
15 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation
Road map
Goals: Fully-distributed (no exact gradient evaluations) Direct acceleration with optimal communication costs Applicable to solve stochastic finite-sum problems - optimal sampling complexity Outline: Gradient Extrapolation Method - GEM Interpretation on GEM Randomized Gradient Extrapolation Method - RGEM
16 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Preliminaries
Prox-function and prox-mapping
We define a prox-function associated with ω as P(x0, x) := ω(x) − [ω(x0) + ω′(x0), x − x0], and the prox-mapping associated with X and ω is given by MX(g, x0, η) := arg min
x∈X
- g, x + µ ω(x) + ηP(x0, x)
- .
P(·, ·) is a generalization of Bregman’s distance, since ω is not necessarily differentiable. P(·, ·) is strongly convex w.r.t. an arbitrary norm because
- f the strong convexity of ω.
Reasonable to assume the above prox-mapping problem is easy to solve.
17 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)
The Problem: minx∈X{f(x) + µω(x) := 1
m
m
i=1fi(x) + µω(x)}
Gradient Extrapolation Method (GEM) Initialization: x0 = x0 ∈ X and g−1 = g0 = ∇f(x0). exact gradient evaluation for t = 1, 2, . . . , k do ˜ gt = αt(gt−1 − gt−2) + gt−1. xt = MX(˜ gt, xt−1, ηt). xt =
- xt + τtxt−1
/(1 + τt). gt = ∇f(xt). one exact gradient evaluation end for Output: xk.
18 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dual player (g). minx∈X{f(x) + µ ω(x)} = minx∈X
- maxg∈G{x, g − Jf(g)} + µ ω(x)
- .
The primal player predicts the dual player’s action based on historical information, and determines her/his corresponding action by minimizing predicted cost. ˜ gt = αt(gt−1 − gt−2) + gt−1, xt = arg min
x∈X
- ˜
gt, x + µ ω(x) + ηtP(xt−1, x)
- .
The dual player determines her/his action gt by maximizing the profits. gt = arg min
g∈G
- −xt, g + Jf(g) + τtDg(gt−1, g)
- ⇔
- xt
=
- xt + τtxt−1
/(1 + τt), gt = ∇f(xt).
19 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)
GEM: the dual of Nesterov’s accelerated gradient method
GEM: ˜ gt = αt(gt−1 − gt−2) + gt−1 xt = MX(˜ gt, xt−1, ηt) gt = MG(−xt, gt−1, τt) NEST: ˜ xt = αt(xt−1 − xt−2) + xt−1 gt = MG(−˜ xt, gt−1, τt) xt = MX(gt, xt−1, ηt)
20 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary RGEM
Adding randomization...
The Problem: minx∈X{f(x) + µω(x) := 1
m
m
i=1fi(x) + µω(x)}
GEM Initialization: x0 = x0 ∈ X and g−1 = g0 = ∇f(x0). for t = 1, 2, . . . , k do ˜ gt = αt(gt−1 − gt−2) + gt−1. xt = MX(˜ gt, xt−1, ηt). xt =
- xt + τtxt−1
/(1 + τt). gt = ∇f(xt). end for Output: xk.
21 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Random Gradient Extrapolation Method (RGEM)
The Problem: minx∈X ψ(x) := 1
m
m
i=1fi(x) + µω(x)
RGEM Initialization: x0
i = x0 ∈ X, ∀i, y−1 = y0 = 0. No exact gradient evaluation
for t = 1, . . . , k do Choose it uniformly from {1, . . . , m}, ˜ yt ←yt−1 + αt(yt−1 − yt−2), xt ←MX( 1
m
m
i=1˜
yt
i , xt−1, ηt),
xt
i ←
- (1 + τt)−1(xt + τtxt−1
i
), i = it, xt−1
i
, i = it. yt
i ←
- ∇fi(xt
i ), i = it, one gradient evaluation
yt−1
i
, i = it. end for Output: For some θt > 0, set xk := (k
t=1θt)−1k t=1θtxt.
22 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Random Gradient Extrapolation Method (RGEM)
RGEM from the server and activated agent perspective
The server updates iterates xt and calculates the output solution xk given by sumx/sumθ. One agent is activated at a time, updating local variables xt and uploads the changes of gradient ∆y to the server.
23 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM
RGEM for deterministic finite-sum optimization
Theorem Let x∗ be an optimal solution, ˆ L = maxi=1,...,m Li, τt =
1 m(1−α) − 1, ηt = α 1−αµ, αt ≡ mα, α = 1 − 1 m+√ m2+16mˆ L/µ.
E[P(xk, x∗)] ≤
2∆0,σ0αk µ
, E[ψ(xk) − ψ(x∗)] ≤ 16 max
- m,
ˆ L µ
- ∆0,σ0αk/2,
where ∆0,σ0 := µP(x0, x∗) + ψ(x0) − ψ(x∗) + σ2
mµ, and σ0 satisfies 1 m
m
i=1∇fi(x0)2 ∗ ≤ σ2 0.
To obtain a stochastic ǫ-solution ⇒ # of gradient evaluations of fi / communication rounds: O{
- m +
- mˆ
L µ
- log 1
ǫ } (not improvable, Lan and Zhou 17).
24 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM
The Problem: minx∈X ψ(x) := 1
m
m
i=1Eξi[Fi(x, ξi)] + µω(x)
Assumption
At iteration t, for a given point xt
i ∈ X, a stochastic first-order (SO) oracle
- utputs a vector Gi(xt
i , ξt i ) s.t.
Eξ[Gi(xt
i , ξt i )] = ∇fi(xt i ), i = 1, . . . , m,
Eξ[Gi(xt
i , ξt i ) − ∇fi(xt i )2 ∗] ≤ σ2, i = 1, . . . , m.
RGEM for stochastic finite-sum optimization
The same as RGEM except that the gradient update step is replaced by y t
i ←
- 1
Bt
Bt
j=1Gi(xt i , ξt i,j), i = it, stochastic gradients of fi given by SO
y t−1
i
, i = it.
25 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM
RGEM for stochastic finite-sum optimization
Theorem Let τt, ηt and αt be the same as before and let Bt = ⌈k(1 − α)2α−t⌉, t = 1, . . . , k, . E[P(xk, x∗)] ≤
2αk∆0,σ0,σ µ
, E[ψ(xk) − ψ(x∗)] ≤ 16 max
- m,
ˆ L µ
- ∆0,σ0,σαk/2,
, where ∆0,σ0,σ := µP(x0, x∗) + ψ(x0) − ψ(x∗) + σ2
0/m+5σ2
µ
. communication rounds: O{
- m +
- mˆ
L/µ
- log 1
ǫ },
stochastic gradient evaluations: ˜ O{(
∆0,σ0,σ µǫ
+ m +
- mˆ
L/µ)}. Note: Sampling complexity independent of m asymptotically.
26 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning: Require no exact gradient evaluations of f Involve communication only between the server and the activated agent iteratively, and tolerate communication failures Possess direct accelerated algorithmic scheme Stochastic/online optimization - minimization of generalization risk Optimal O{(m +
- mˆ
L/µ) log 1
ǫ } communication complexity
Nearly optimal ˜ O{
∆0,σ0,σ µǫ
} sampling complexity.
27 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Network topology?
Example: Policy evaluation for multi-agent reinforcement learning (Wai, Yang, Wang, Hong 18).
28 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Decentralized optimization techniques
Most studies focus on deterministic optimization (e.g., Nedic and Ozdaglar 09; Shi, Ling, Wu, and Yin 15).
O(1/ǫ) communication rounds and gradient computation. O(log(1/ǫ)) communication rounds for unconstrained smooth and strongly convex problems.
For stochastic optimization problems
Direct extension of SGD type methods (e.g. Duchi, Agarwal, and Wainwright 12). O(1/ǫ2) communication rounds and stochastic (sub)gradient computations.
Question Is SGD still a good algorithm for decentralized stochastic
- ptimization and machine learning?
29 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
How to handle decentralized structure?
Dual decomposition (explicit) minx F(x) := m
i=1fi(xi)
s.t. x1 = x2 = . . . = xm, xi ∈ Xi, ∀i = 1, . . . , m. where x = [xT
1 , . . . , xT m].
30 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Background: Laplacian L
Let Ni denote the set of neighbors of agent i: Ni = {j ∈ V | (i, j) ∈ E} ∪ {i} Then, the Laplacian L ∈ Rm×m of a graph G = (V, E) is defined as: Lij = |Ni| − 1 if i = j −1 if i = j and (i, j) ∈ E
- therwise.
For example: L = 1 −1 −1 2 −1 −1 1 L1 = 0 “Agreement Subspace”
31 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Problem Formulation
min
x
F(x) :=
m
- i=1
fi(xi) s.t. x1 = · · · = xm xi ∈ Xi, i = 1, . . . , m (=) min
x
F(x) :=
m
- i=1
fi(xi) s.t. xi = xj, ∀(i, j) ∈ E xi ∈ Xi, i = 1, . . . , m If G is connected (=) min
x
F(x) :=
m
- i=1
fi(xi) s.t. Lx = 0 xi ∈ Xi, i = 1, . . . , m Using Laplacian L, consistency constraints can be compactly rewritten L := L ⊗ Id (=) min
x∈X m F(x) + max y∈RmdLx, y
Equivalent Saddle Point form
32 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Decentralized Primal-Dual (DPD): Vector Form
min
x∈X m F(x) + max y∈RmdLx, y
x := [x⊤
1 · · · x⊤ m]⊤
y := [y⊤
1 · · · y⊤ m]⊤
Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, and {θk} be given. For k = 1, . . . , N, update zk = (xk, yk) ˜ xk = αk(xk−1 − xk−2) + xk−1 yk = argminy∈Rmd −L˜ xk, y + τk
2 y − yk−12
xk = argminx∈X m Lyk, x + F(x) + ηk
2 xk−1 − x2
Return ¯ zN = (N
k=1θk)−1N k=1θkzk.
33 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
DPD: Agent i’s point of view
Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, and {θk} be given.
For k = 1, . . . , N, update zk
i = (xk i , yk i )
˜ xk
i = αk(xk−1 i
− xk−2
i
) + xk−1
i
vk
i = j∈NiLij ˜
xk
j
yk
i = yk−1 i
+ 1
τk vk i
wk
i = j∈NiLijyk j
xk
i = argminxi∈Xi wk i , xi + fi(xi) + ηk 2 xk−1 i
− xi2 Return ¯ zN = (N
k=1θk)−1N k=1θkzk
34 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Decentralized Communication Sliding (DCS)
Q: Is the subproblem always easy to solve? xk
i = argminxi∈Xi wk i , xi + fi(xi) + ηk 2 xk−1 i
− xi2 A: No, solve this iteratively using linearization of fi(xi)
Let u0 = ˆ u0 = xk−1
i
, {βt}, and {λt} be given. For t = 1, . . . , Tk, ht−1 ∈ ∂fi(ut−1) ut = argminu∈Xiht−1 + wk
i , u + ηk 2 xk−1 i
− u2 + ηkβt
2 ut−1 − u2
Return xk
i = uT and ˆ
xk
i =
t
t=1 λt
−1 t
t=1 λtut
The same wk
i is used, communication is skipped! There are
two output points xk
i and ˆ
xk
i .
35 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Decentralized Communication Sliding (DCS)
Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, {θk}, and {Tk} be given.
For k = 1, . . . , N, update zk
i = (ˆ
xk
i , yk i )
˜ xk
i = αk(ˆ
xk−1
i
− xk−2
i
) + xk−1
i
vk
i = j∈NiLij ˜
xk
j
yk
i = argminyi∈Rd −vk i , yi + τk 2 yi − yk−1 i
2 wk
i = j∈NiLijyk j
(xk
i , ˆ
xk
i ) = Inner loop for Tk times
36 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Convergence of DCS
Theorem Let ˆ xN = 1
N
N
k=1ˆ
xk and x∗ be an optimal solution. If αk = θk = 1, ηk = 2L, τk = L, and Tk =
- m(M2+σ2)N
L2 ˜ D
- ,
then F(ˆ xN) − F(x∗) ≤ L
N
- 3
2x0 − x∗2 + 1 2y02 + 4˜
D
- ,
Lˆ xN ≤ L
N
- 3
- 3x0 − x∗2 + 8˜
D + 4y∗ − y0
- .
O (1/ǫ) iterations for ǫ-optimal and ǫ-feasible solution. # of required communications is also O (1/ǫ). # of subgradient evaluations is O
- 1/ǫ2
.
37 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Problem formulation
Decentralized stochastic optimization fi(x) := Eξi[Fi(x; ξi)], where ξi models agent i’s uncertainty and P(ξi) not known. Only noisy first-order information Gi(·, ξt
i ) is available:
E[Gi(ut, ξt
i )] = f ′ i (ut) ∈ ∂fi(ut),
E[Gi(ut, ξt
i ) − f ′ i (ut)2 ∗] ≤ σ2.
38 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm
The primal subproblem is solved with noisy subgradients xk
i = argminxi∈Xiwk i , xi + fi(xi) + ηk 2 xk−1 i
− xi2.
Let u0 = ˆ u0 = xk−1
i
, {βt}, and {λt} be given. For t = 1, . . . , Tk, ht−1 = Gi(ut−1, ξt−1
i
) ut = argminu∈Xiht−1 + wk
i , u + ηk 2 xk−1 i
− u2 + ηkβt
2 ut−1 − u2
Return xk
i = uT and ˆ
xk
i =
t
t=1 λt
−1 t
t=1 λtut
Observations: The same wk
i is used, communication is
skipped! There are two output points xk
i and ˆ
xk
i .
39 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Convergence results
Convergence of SCDS
Theorem Let ˆ xN = 1
N
N
k=1ˆ
- xk. If βt = t
2, λt = t + 1, αk = θk = 1, ηk = 2L,
τk = L and Tk =
- m(M2+σ2)N
L2 ˜ D
- for some ˜
D > 0, then E[F(ˆ xN) − F(x∗)] ≤ L
N
- 3
2x0 − x∗2 + 1 2y02 + 4˜
D
- ,
E[Lˆ xN] ≤ L
N
- 3
- 3x0 − x∗2 + 8˜
D + 4y∗ − y0
- .
Conclusion:
O (1/ǫ) iterations for ǫ-optimal and ǫ-feasible solution. # of required communications is also O (1/ǫ). # of stochastic subgradient evaluations is O
- 1/ǫ2
.
40 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Convergence results
Summary of convergence results
Table: Complexity for obtaining an ǫ-optimal and ǫ-feasible solution
Algorithm (problem type) # of communications # of subgradient evaluations DCS: Convex O
- LD2
Xm
ǫ
- O
- mM2D2
Xm
ǫ2
- SDCS: Convex
O
- LD2
Xm
ǫ
- O
- m(M2+σ2)D2
Xm
ǫ2
- DCS: Strongly convex
O
- µD2
Xm
ǫ
- O
- mM2
µǫ
- SDCS: Strongly convex
O
- µD2
Xm
ǫ
- O
- m(M2+σ2)
µǫ
- 41 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Comparisons with centralized SGD
Assumptions: In the worse case, Mf ≤ mM, µf ≥ mµ, D2
X/D2 X m = O(1/m) and ˜
σ2 ≤ mσ2.
Table: # of stochastic subgradient evaluations
Problem type SDCS (individual agent) SGD Convex O
- m(M2+σ2)D2
Xm
ǫ2
- O
- m(M2
f +˜
σ2)D2
X
ǫ2
- Strongly convex
O
- m(M2+σ2)
µǫ
- O
- m(M2
f +˜
σ2) µf ǫ
- Conclusion: Sampling complexity comparable to centralized
SGD under reasonable assumptions and hence not improvable in general.
42 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Numerical results
Numerical example
Test problem: decentralized linear SVM model min
x
m
i=1E(ui ,vi )[max{0, 1 − vixi, ui}]
s.t. Lx = 0. Network structure: connected graph with 100 nodes. Data set: real data set “ijcnn1” from LIBSVM.
The underlying decentralized network 43 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Numerical results
Comparing with distributed dual averaging
Conclusion: SDCS saves inter-node communication rounds while preserving the same order of sampling complexity.
44 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Centralized SGD Random gradient extroplation for federated learning over networks
Optimal O(
- mL/µ log(1/ǫ)) communication complexity
Nearly O(1/ǫ) optimal sampling complexity
Stochastic communication sliding for decentralized learning over networks
# stochastic subgradient evaluation is comparable to centralized SGD. # communication rounds is negligible in comparison with # stochastic subgradient evaluation.
45 / 46
beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary
Thanks!
- G. Lan and Y. Zhou, “Random gradient extrapolation for distributed and
stochastic optimization”, SIAM Journal on Optimization, 28(4), 2753-2782, 2018.
- G. Lan, S. Lee and Y. Zhou, “Communication-efficient Algorithms for
Decentralized and Stochastic Optimization”, Mathematical Programming to appear.
46 / 46