Stochastic Optimization for Learning over Networks Guanghui - - PowerPoint PPT Presentation

stochastic optimization for learning over networks
SMART_READER_LITE
LIVE PREVIEW

Stochastic Optimization for Learning over Networks Guanghui - - PowerPoint PPT Presentation

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences


slide-1
SLIDE 1

Stochastic Optimization for Learning over Networks

Guanghui (George) Lan

School of Industrial and Systems Engineering Georgia Institute of Technology

East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences George Mason University Fairfax, Virginia (USA)

slide-2
SLIDE 2

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Machine Learning

Given a set of observed data S = {(ui, vi)}m

i=1, drawn from a

certain unknown distribution D on U × V. Goal: to describe the relation between ui and vi’s for prediction. Applications: predicting strokes and seizures, identifying heart failure, stopping credit card fraud, predicting machine failure, identifying spam, ...... Classic models:

Lasso regression: minx E[(x, u − v)2] + ρx1. Support vector machine: min Eu,v [max{0, vx, u] + ρx2

2.

Deep learning: minx Eu,v(F(u, x) − v)2 + ρUx1

2 / 46

slide-3
SLIDE 3

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Machine learning and stochastic optimization

Generic stochastic optimization model: minx∈X {f(x) := Eξ[F(x, ξ)]} . In ML, F is the regularized loss function and ξ = (u, v): F(x, ξ) = (x, u − v)2 + ρx1 F(x, ξ) = max{0, vx, u} + ρx2

2.

To compute the gradient of f is expensive or impossible. Stochastic first-order methods: iterative methods which

  • perate with the stochastic gradients (subgradients) of f.

3 / 46

slide-4
SLIDE 4

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Learning over networks

How about data are distributed over a multi-agent network? minx f(x) := m

i=1fi(x)

s.t. x ∈ X, X := ∩m

i=1Xi,

where fi(x) = E[Fi(x, ζi)] is given in the form of expectation. Optimization defined over complex multi-agent network. Each agent has its own data (observations of ζi). Data usually are private - no sharing. Can share knowledge learned from data. Communication among agents can be expensive. Data can be captured on-line.

4 / 46

slide-5
SLIDE 5

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Example: SVM over networks

Three agents: minx 1

3[f1(x) + f2(x) + f3(x)]

f1(x) =

1 N1

N1

i=1

  • max{0, v1

i x, u1 i

  • + ρx2

2.

f2(x) =

1 N2

N2

i=1

  • max{0, v2

i x, u2 i

  • + ρx2

2.

f3(x) =

1 N3

N3

i=1

  • max{0, v3

i x, u3 i

  • + ρx2

2.

Dataset for agent 1: {(u1

i , v1 i )}N1 i=1.

Dataset for agent 2: {(u2

i , v2 i )}N2 i=1.

Dataset for agent 3: {(u2

i , v2 i )}N3 i=1.

Each agent accesses its own dataset.

5 / 46

slide-6
SLIDE 6

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Example: SVM over networks with streaming data

minx 1

3[f1(x) + f2(x) + f3(x)],

where fj(x) = E

  • max{0, vjx, uj
  • + ρx2

2, j = 1, 2, 3.

Dataset for agent i can be viewed as samples of random vector (uj, vj), j = 1, 2, 3. (uj, vj) can satisfy different distribution. Samples can be collected in an online fashion. Agents can possibly share solutions, but not the samples. Need to minimize the communication costs. Key questions # samples - sampling complexity # communication rounds - communication complexity

6 / 46

slide-7
SLIDE 7

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Network topology?

7 / 46

slide-8
SLIDE 8

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Centralized stochastic gradient descent

Sampling complexity

Distributed SGD and federated learning

Sampling complexity Communication complexity

Decentralized SGD

How to communicate Sampling complexity Communication complexity

8 / 46

slide-9
SLIDE 9

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Stochastic (sub)gradients

The Problem: minx∈X{f(x) := Eξ[F(x, ξ)]}. Stochastic (sub)gradients At iteration t, xt ∈ X being the input, we have access to a vector G(xt, ξt), where {ξt}t≥1 are i.i.d. random variables s.t. E[G(xt, ξt)] ≡ g(xt) ∈ ∂Ψ(xt), E[G(x, ξ)2] ≤ M2.

Examples: Regression with batch data: minx f(x) = 1

N

N

i=1(x, ui − vi)2

Stochastic gradient: 2(x, uit − vit)uit

Regression with streaming data: minx f(x) = E

  • (x, u − v)2

Stochastic gradient: 2(x, u − v)u

9 / 46

slide-10
SLIDE 10

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Stochastic (sub)grdient descent

The algorithm xt+1 = argminx∈Xx − (xt − γtGt)2

2, t = 1, 2, . . .

Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09)) Let DX ≥ maxx1,x2∈X x1 − x22. If γt =

  • Ω2/(kM2), t = 1, . . . , k, and

¯ xk = k

t=1xt/k,

E[f(¯ xk) − f ∗] ≤ MDX

2 √ k , ∀k ≥ 1.

Sampling complexity # samples = # iterations = O(1)( M2D2

X

ǫ2 ),

to find a solution ¯ x ∈ X such that E[f(¯ x) − f ∗] ≤ ǫ.

10 / 46

slide-11
SLIDE 11

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Recent developments

Accelerated SGD (Lan 08 (12))

Stochastic version of Nesterov’s accelerated gradient method A universally optimal method for smooth, nonsmooth and stochastic optimization The impact of Lipschitz constants vanishes for stochastic problems Popular in training deep neural networks (Sutskever, Martens, Dahl, Hinton 13)

Adaptive stochastic subgradient (Duchi, Hazan, Singer 11) Nonconvex SGD and its acceleration (Ghadimi and Lan 12) Adaptive sample sizes (Byrd, Chin, Nocedal and Wu 12) SGD for finite-sum problems (Schmidt, Roux and Bach 13) Optimal incremental gradient methods (Lan and Zhou 15)

11 / 46

slide-12
SLIDE 12

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

The distributed structure - star topology

Figure: A cloud-device based distributed learning system

Data sets are distributed

  • ver individual agents

(devices) in the network. All devices are connected to a parameter server (or central cloud), which controls the learning process and updates solutions. One example: federated learning.

12 / 46

slide-13
SLIDE 13

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Stochastic finite sum optimization

Consider the convex programming (CP) problem given by min

x∈X Ψ(x) := 1 m

m

i=1fi(x) + µ ω(x).

X ⊆ Rn is a closed convex set. fi : Rn → R, i = 1, . . . , m, are smooth convex with Lipschitz constants Li ≥ 0. f(x) := 1

m

m

i=1fi(x) is smooth convex

with Lipschitz constant Lf ≤ L = 1

m

m

i=1Li.

ω : X → R is a strongly convex function with modulus 1 w.r.t. an arbitrary norm · . µ ≥ 0 is a given constant. fi(x) = E[Fi(x, ξi)] can be represented by a stochastic

  • racle, providing stochastic (sub)gradients upon request.

13 / 46

slide-14
SLIDE 14

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation

Randomized incremental gradient (RIG) methods

Randomized incremental gradient (RIG) methods solve minx∈X 1

m

m

i=1fi(x) iteratively, at k-th iteration,

1) Randomly select a component index ik from 1, . . . , m (server). 2) Compute the gradient of the component function fik(xk) (agents). 3) Set xk+1 = PX(xk − αk∇fik(xk)), where αk is a positive step-size, PX(·) denotes projection on X (server). Potentially save the total number of gradient computations. Save communication costs in distributed setting .

14 / 46

slide-15
SLIDE 15

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation

Existing RIG methods

SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, and SVRG in Johnson and Zhang, 13 obtained O

  • (m + L/µ) log 1

ǫ

  • rate of convergence rate of convergence not optimal

RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14, Dang and Lan 14) require exact gradient evaluation at the initial point, and differentiability over Rn Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16 require re-evaluating exact gradients from time to time synchronous delays No existing studies on stochastic finite-sum: each fi is represented by a stochastic oracle, or each agent only has access to noisy first-order information

15 / 46

slide-16
SLIDE 16

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation

Road map

Goals: Fully-distributed (no exact gradient evaluations) Direct acceleration with optimal communication costs Applicable to solve stochastic finite-sum problems - optimal sampling complexity Outline: Gradient Extrapolation Method - GEM Interpretation on GEM Randomized Gradient Extrapolation Method - RGEM

16 / 46

slide-17
SLIDE 17

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Preliminaries

Prox-function and prox-mapping

We define a prox-function associated with ω as P(x0, x) := ω(x) − [ω(x0) + ω′(x0), x − x0], and the prox-mapping associated with X and ω is given by MX(g, x0, η) := arg min

x∈X

  • g, x + µ ω(x) + ηP(x0, x)
  • .

P(·, ·) is a generalization of Bregman’s distance, since ω is not necessarily differentiable. P(·, ·) is strongly convex w.r.t. an arbitrary norm because

  • f the strong convexity of ω.

Reasonable to assume the above prox-mapping problem is easy to solve.

17 / 46

slide-18
SLIDE 18

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)

The Problem: minx∈X{f(x) + µω(x) := 1

m

m

i=1fi(x) + µω(x)}

Gradient Extrapolation Method (GEM) Initialization: x0 = x0 ∈ X and g−1 = g0 = ∇f(x0). exact gradient evaluation for t = 1, 2, . . . , k do ˜ gt = αt(gt−1 − gt−2) + gt−1. xt = MX(˜ gt, xt−1, ηt). xt =

  • xt + τtxt−1

/(1 + τt). gt = ∇f(xt). one exact gradient evaluation end for Output: xk.

18 / 46

slide-19
SLIDE 19

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)

Intuition: Game interpretation of GEM

A game iteratively performed by a primal player (x) and a dual player (g). minx∈X{f(x) + µ ω(x)} = minx∈X

  • maxg∈G{x, g − Jf(g)} + µ ω(x)
  • .

The primal player predicts the dual player’s action based on historical information, and determines her/his corresponding action by minimizing predicted cost. ˜ gt = αt(gt−1 − gt−2) + gt−1, xt = arg min

x∈X

  • ˜

gt, x + µ ω(x) + ηtP(xt−1, x)

  • .

The dual player determines her/his action gt by maximizing the profits. gt = arg min

g∈G

  • −xt, g + Jf(g) + τtDg(gt−1, g)
  • xt

=

  • xt + τtxt−1

/(1 + τt), gt = ∇f(xt).

19 / 46

slide-20
SLIDE 20

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Gradient extrapolation method (GEM)

GEM: the dual of Nesterov’s accelerated gradient method

GEM: ˜ gt = αt(gt−1 − gt−2) + gt−1 xt = MX(˜ gt, xt−1, ηt) gt = MG(−xt, gt−1, τt) NEST: ˜ xt = αt(xt−1 − xt−2) + xt−1 gt = MG(−˜ xt, gt−1, τt) xt = MX(gt, xt−1, ηt)

20 / 46

slide-21
SLIDE 21

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary RGEM

Adding randomization...

The Problem: minx∈X{f(x) + µω(x) := 1

m

m

i=1fi(x) + µω(x)}

GEM Initialization: x0 = x0 ∈ X and g−1 = g0 = ∇f(x0). for t = 1, 2, . . . , k do ˜ gt = αt(gt−1 − gt−2) + gt−1. xt = MX(˜ gt, xt−1, ηt). xt =

  • xt + τtxt−1

/(1 + τt). gt = ∇f(xt). end for Output: xk.

21 / 46

slide-22
SLIDE 22

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Random Gradient Extrapolation Method (RGEM)

The Problem: minx∈X ψ(x) := 1

m

m

i=1fi(x) + µω(x)

RGEM Initialization: x0

i = x0 ∈ X, ∀i, y−1 = y0 = 0. No exact gradient evaluation

for t = 1, . . . , k do Choose it uniformly from {1, . . . , m}, ˜ yt ←yt−1 + αt(yt−1 − yt−2), xt ←MX( 1

m

m

i=1˜

yt

i , xt−1, ηt),

xt

i ←

  • (1 + τt)−1(xt + τtxt−1

i

), i = it, xt−1

i

, i = it. yt

i ←

  • ∇fi(xt

i ), i = it, one gradient evaluation

yt−1

i

, i = it. end for Output: For some θt > 0, set xk := (k

t=1θt)−1k t=1θtxt.

22 / 46

slide-23
SLIDE 23

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm - Random Gradient Extrapolation Method (RGEM)

RGEM from the server and activated agent perspective

The server updates iterates xt and calculates the output solution xk given by sumx/sumθ. One agent is activated at a time, updating local variables xt and uploads the changes of gradient ∆y to the server.

23 / 46

slide-24
SLIDE 24

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM

RGEM for deterministic finite-sum optimization

Theorem Let x∗ be an optimal solution, ˆ L = maxi=1,...,m Li, τt =

1 m(1−α) − 1, ηt = α 1−αµ, αt ≡ mα, α = 1 − 1 m+√ m2+16mˆ L/µ.

E[P(xk, x∗)] ≤

2∆0,σ0αk µ

, E[ψ(xk) − ψ(x∗)] ≤ 16 max

  • m,

ˆ L µ

  • ∆0,σ0αk/2,

where ∆0,σ0 := µP(x0, x∗) + ψ(x0) − ψ(x∗) + σ2

mµ, and σ0 satisfies 1 m

m

i=1∇fi(x0)2 ∗ ≤ σ2 0.

To obtain a stochastic ǫ-solution ⇒ # of gradient evaluations of fi / communication rounds: O{

  • m +

L µ

  • log 1

ǫ } (not improvable, Lan and Zhou 17).

24 / 46

slide-25
SLIDE 25

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM

The Problem: minx∈X ψ(x) := 1

m

m

i=1Eξi[Fi(x, ξi)] + µω(x)

Assumption

At iteration t, for a given point xt

i ∈ X, a stochastic first-order (SO) oracle

  • utputs a vector Gi(xt

i , ξt i ) s.t.

Eξ[Gi(xt

i , ξt i )] = ∇fi(xt i ), i = 1, . . . , m,

Eξ[Gi(xt

i , ξt i ) − ∇fi(xt i )2 ∗] ≤ σ2, i = 1, . . . , m.

RGEM for stochastic finite-sum optimization

The same as RGEM except that the gradient update step is replaced by y t

i ←

  • 1

Bt

Bt

j=1Gi(xt i , ξt i,j), i = it, stochastic gradients of fi given by SO

y t−1

i

, i = it.

25 / 46

slide-26
SLIDE 26

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM

RGEM for stochastic finite-sum optimization

Theorem Let τt, ηt and αt be the same as before and let Bt = ⌈k(1 − α)2α−t⌉, t = 1, . . . , k, . E[P(xk, x∗)] ≤

2αk∆0,σ0,σ µ

, E[ψ(xk) − ψ(x∗)] ≤ 16 max

  • m,

ˆ L µ

  • ∆0,σ0,σαk/2,

, where ∆0,σ0,σ := µP(x0, x∗) + ψ(x0) − ψ(x∗) + σ2

0/m+5σ2

µ

. communication rounds: O{

  • m +

L/µ

  • log 1

ǫ },

stochastic gradient evaluations: ˜ O{(

∆0,σ0,σ µǫ

+ m +

L/µ)}. Note: Sampling complexity independent of m asymptotically.

26 / 46

slide-27
SLIDE 27

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Optimality of RGEM

Advantages of RGEM for distributed learning

RGEM is an enhanced RIG method for distributed learning: Require no exact gradient evaluations of f Involve communication only between the server and the activated agent iteratively, and tolerate communication failures Possess direct accelerated algorithmic scheme Stochastic/online optimization - minimization of generalization risk Optimal O{(m +

L/µ) log 1

ǫ } communication complexity

Nearly optimal ˜ O{

∆0,σ0,σ µǫ

} sampling complexity.

27 / 46

slide-28
SLIDE 28

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Network topology?

Example: Policy evaluation for multi-agent reinforcement learning (Wai, Yang, Wang, Hong 18).

28 / 46

slide-29
SLIDE 29

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Decentralized optimization techniques

Most studies focus on deterministic optimization (e.g., Nedic and Ozdaglar 09; Shi, Ling, Wu, and Yin 15).

O(1/ǫ) communication rounds and gradient computation. O(log(1/ǫ)) communication rounds for unconstrained smooth and strongly convex problems.

For stochastic optimization problems

Direct extension of SGD type methods (e.g. Duchi, Agarwal, and Wainwright 12). O(1/ǫ2) communication rounds and stochastic (sub)gradient computations.

Question Is SGD still a good algorithm for decentralized stochastic

  • ptimization and machine learning?

29 / 46

slide-30
SLIDE 30

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

How to handle decentralized structure?

Dual decomposition (explicit) minx F(x) := m

i=1fi(xi)

s.t. x1 = x2 = . . . = xm, xi ∈ Xi, ∀i = 1, . . . , m. where x = [xT

1 , . . . , xT m].

30 / 46

slide-31
SLIDE 31

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Background: Laplacian L

Let Ni denote the set of neighbors of agent i: Ni = {j ∈ V | (i, j) ∈ E} ∪ {i} Then, the Laplacian L ∈ Rm×m of a graph G = (V, E) is defined as: Lij =    |Ni| − 1 if i = j −1 if i = j and (i, j) ∈ E

  • therwise.

For example: L =   1 −1 −1 2 −1 −1 1   L1 = 0 “Agreement Subspace”

31 / 46

slide-32
SLIDE 32

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Problem Formulation

min

x

F(x) :=

m

  • i=1

fi(xi) s.t. x1 = · · · = xm xi ∈ Xi, i = 1, . . . , m (=) min

x

F(x) :=

m

  • i=1

fi(xi) s.t. xi = xj, ∀(i, j) ∈ E xi ∈ Xi, i = 1, . . . , m If G is connected (=) min

x

F(x) :=

m

  • i=1

fi(xi) s.t. Lx = 0 xi ∈ Xi, i = 1, . . . , m Using Laplacian L, consistency constraints can be compactly rewritten L := L ⊗ Id (=) min

x∈X m F(x) + max y∈RmdLx, y

Equivalent Saddle Point form

32 / 46

slide-33
SLIDE 33

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Decentralized Primal-Dual (DPD): Vector Form

min

x∈X m F(x) + max y∈RmdLx, y

x := [x⊤

1 · · · x⊤ m]⊤

y := [y⊤

1 · · · y⊤ m]⊤

Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, and {θk} be given. For k = 1, . . . , N, update zk = (xk, yk) ˜ xk = αk(xk−1 − xk−2) + xk−1 yk = argminy∈Rmd −L˜ xk, y + τk

2 y − yk−12

xk = argminx∈X m Lyk, x + F(x) + ηk

2 xk−1 − x2

Return ¯ zN = (N

k=1θk)−1N k=1θkzk.

33 / 46

slide-34
SLIDE 34

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

DPD: Agent i’s point of view

Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, and {θk} be given.

For k = 1, . . . , N, update zk

i = (xk i , yk i )

˜ xk

i = αk(xk−1 i

− xk−2

i

) + xk−1

i

vk

i = j∈NiLij ˜

xk

j

yk

i = yk−1 i

+ 1

τk vk i

wk

i = j∈NiLijyk j

xk

i = argminxi∈Xi wk i , xi + fi(xi) + ηk 2 xk−1 i

− xi2 Return ¯ zN = (N

k=1θk)−1N k=1θkzk

34 / 46

slide-35
SLIDE 35

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Decentralized Communication Sliding (DCS)

Q: Is the subproblem always easy to solve? xk

i = argminxi∈Xi wk i , xi + fi(xi) + ηk 2 xk−1 i

− xi2 A: No, solve this iteratively using linearization of fi(xi)

Let u0 = ˆ u0 = xk−1

i

, {βt}, and {λt} be given. For t = 1, . . . , Tk, ht−1 ∈ ∂fi(ut−1) ut = argminu∈Xiht−1 + wk

i , u + ηk 2 xk−1 i

− u2 + ηkβt

2 ut−1 − u2

Return xk

i = uT and ˆ

xk

i =

t

t=1 λt

−1 t

t=1 λtut

The same wk

i is used, communication is skipped! There are

two output points xk

i and ˆ

xk

i .

35 / 46

slide-36
SLIDE 36

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Decentralized Communication Sliding (DCS)

Let x0 = x−1 ∈ X m, y0 ∈ Rmd, {αk}, {τk}, {ηk}, {θk}, and {Tk} be given.

For k = 1, . . . , N, update zk

i = (ˆ

xk

i , yk i )

˜ xk

i = αk(ˆ

xk−1

i

− xk−2

i

) + xk−1

i

vk

i = j∈NiLij ˜

xk

j

yk

i = argminyi∈Rd −vk i , yi + τk 2 yi − yk−1 i

2 wk

i = j∈NiLijyk j

(xk

i , ˆ

xk

i ) = Inner loop for Tk times

36 / 46

slide-37
SLIDE 37

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Convergence of DCS

Theorem Let ˆ xN = 1

N

N

k=1ˆ

xk and x∗ be an optimal solution. If αk = θk = 1, ηk = 2L, τk = L, and Tk =

  • m(M2+σ2)N

L2 ˜ D

  • ,

then F(ˆ xN) − F(x∗) ≤ L

N

  • 3

2x0 − x∗2 + 1 2y02 + 4˜

D

  • ,

Lˆ xN ≤ L

N

  • 3
  • 3x0 − x∗2 + 8˜

D + 4y∗ − y0

  • .

O (1/ǫ) iterations for ǫ-optimal and ǫ-feasible solution. # of required communications is also O (1/ǫ). # of subgradient evaluations is O

  • 1/ǫ2

.

37 / 46

slide-38
SLIDE 38

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Problem formulation

Decentralized stochastic optimization fi(x) := Eξi[Fi(x; ξi)], where ξi models agent i’s uncertainty and P(ξi) not known. Only noisy first-order information Gi(·, ξt

i ) is available:

E[Gi(ut, ξt

i )] = f ′ i (ut) ∈ ∂fi(ut),

E[Gi(ut, ξt

i ) − f ′ i (ut)2 ∗] ≤ σ2.

38 / 46

slide-39
SLIDE 39

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The algorithm

The primal subproblem is solved with noisy subgradients xk

i = argminxi∈Xiwk i , xi + fi(xi) + ηk 2 xk−1 i

− xi2.

Let u0 = ˆ u0 = xk−1

i

, {βt}, and {λt} be given. For t = 1, . . . , Tk, ht−1 = Gi(ut−1, ξt−1

i

) ut = argminu∈Xiht−1 + wk

i , u + ηk 2 xk−1 i

− u2 + ηkβt

2 ut−1 − u2

Return xk

i = uT and ˆ

xk

i =

t

t=1 λt

−1 t

t=1 λtut

Observations: The same wk

i is used, communication is

skipped! There are two output points xk

i and ˆ

xk

i .

39 / 46

slide-40
SLIDE 40

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Convergence results

Convergence of SCDS

Theorem Let ˆ xN = 1

N

N

k=1ˆ

  • xk. If βt = t

2, λt = t + 1, αk = θk = 1, ηk = 2L,

τk = L and Tk =

  • m(M2+σ2)N

L2 ˜ D

  • for some ˜

D > 0, then E[F(ˆ xN) − F(x∗)] ≤ L

N

  • 3

2x0 − x∗2 + 1 2y02 + 4˜

D

  • ,

E[Lˆ xN] ≤ L

N

  • 3
  • 3x0 − x∗2 + 8˜

D + 4y∗ − y0

  • .

Conclusion:

O (1/ǫ) iterations for ǫ-optimal and ǫ-feasible solution. # of required communications is also O (1/ǫ). # of stochastic subgradient evaluations is O

  • 1/ǫ2

.

40 / 46

slide-41
SLIDE 41

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Convergence results

Summary of convergence results

Table: Complexity for obtaining an ǫ-optimal and ǫ-feasible solution

Algorithm (problem type) # of communications # of subgradient evaluations DCS: Convex O

  • LD2

Xm

ǫ

  • O
  • mM2D2

Xm

ǫ2

  • SDCS: Convex

O

  • LD2

Xm

ǫ

  • O
  • m(M2+σ2)D2

Xm

ǫ2

  • DCS: Strongly convex

O

  • µD2

Xm

ǫ

  • O
  • mM2

µǫ

  • SDCS: Strongly convex

O

  • µD2

Xm

ǫ

  • O
  • m(M2+σ2)

µǫ

  • 41 / 46
slide-42
SLIDE 42

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Comparisons with centralized SGD

Assumptions: In the worse case, Mf ≤ mM, µf ≥ mµ, D2

X/D2 X m = O(1/m) and ˜

σ2 ≤ mσ2.

Table: # of stochastic subgradient evaluations

Problem type SDCS (individual agent) SGD Convex O

  • m(M2+σ2)D2

Xm

ǫ2

  • O
  • m(M2

f +˜

σ2)D2

X

ǫ2

  • Strongly convex

O

  • m(M2+σ2)

µǫ

  • O
  • m(M2

f +˜

σ2) µf ǫ

  • Conclusion: Sampling complexity comparable to centralized

SGD under reasonable assumptions and hence not improvable in general.

42 / 46

slide-43
SLIDE 43

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Numerical results

Numerical example

Test problem: decentralized linear SVM model min

x

m

i=1E(ui ,vi )[max{0, 1 − vixi, ui}]

s.t. Lx = 0. Network structure: connected graph with 100 nodes. Data set: real data set “ijcnn1” from LIBSVM.

The underlying decentralized network 43 / 46

slide-44
SLIDE 44

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Numerical results

Comparing with distributed dual averaging

Conclusion: SDCS saves inter-node communication rounds while preserving the same order of sampling complexity.

44 / 46

slide-45
SLIDE 45

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Centralized SGD Random gradient extroplation for federated learning over networks

Optimal O(

  • mL/µ log(1/ǫ)) communication complexity

Nearly O(1/ǫ) optimal sampling complexity

Stochastic communication sliding for decentralized learning over networks

# stochastic subgradient evaluation is comparable to centralized SGD. # communication rounds is negligible in comparison with # stochastic subgradient evaluation.

45 / 46

slide-46
SLIDE 46

beamer-tu-logo Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary

Thanks!

  • G. Lan and Y. Zhou, “Random gradient extrapolation for distributed and

stochastic optimization”, SIAM Journal on Optimization, 28(4), 2753-2782, 2018.

  • G. Lan, S. Lee and Y. Zhou, “Communication-efficient Algorithms for

Decentralized and Stochastic Optimization”, Mathematical Programming to appear.

46 / 46