Walkman: A Communication-Efficient Random-Walk Algorithm for - - PowerPoint PPT Presentation

walkman a communication efficient random walk algorithm
SMART_READER_LITE
LIVE PREVIEW

Walkman: A Communication-Efficient Random-Walk Algorithm for - - PowerPoint PPT Presentation

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao Kun Yuan Yubin Hu Yuantao Gu Ali H. Sayed Wotao Yin Tsinghua EE UCLA ECE EPFL Engineering UCLA Math 1 / 36


slide-1
SLIDE 1

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization

Xianghui Mao⋄ Kun Yuan∗ Yubin Hu⋄ Yuantao Gu⋄ Ali H. Sayed† Wotao Yin‡

⋄Tsinghua EE ∗UCLA ECE †EPFL Engineering ‡UCLA Math

1 / 36

slide-2
SLIDE 2

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

2 / 36

slide-3
SLIDE 3

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

3 / 36

slide-4
SLIDE 4

Decentralized optimization

  • Consider a decentralized optimization problem over a network (V, E)

min

x∈Rp

r(x) + 1 n

n

  • i=1

fi(x), (1) where n is the number of nodes

  • Node i has access to fi(x). All nodes can access r(x).
  • Both fi(x) and r(x) can be non-convex

4 / 36

slide-5
SLIDE 5

Gossip-based approaches

Figure: Gossip-based communication

  • Agent communicates with all, or a random subset, of direct neighbors
  • Prior methods: DGD[1], diffusion[2], D-ADMM[3, 4], EXTRA[5],

PG-EXTRA[6], DIGing[7], Exact diffusion[8], NIDS[9] ...

  • Convergence rates are comparable to standard centralized optimization.
  • Every agent communicates ⇒ per-iteration comm. cost at O(n) – O(n2).

5 / 36

slide-6
SLIDE 6

Random-walk approaches

Figure: A random walk (1, 6, 9, 1, 2, 6, 5, ...)

  • A walker moves x through the network and updates it. xk is the kth value.
  • Agent i receiving x will update it with local (sub)gradient ∇fi.
  • O(1) communication per iteration.
  • Prior works [10–13] require decaying step-sizes; slow.

6 / 36

slide-7
SLIDE 7

Proposed method: Walkman

  • Walkman is a random-walk (RW) algorithm
  • Exact convergence with fixed step-size; much faster than existing RWs
  • Convergence guarantee established for non-convex and convex scenarios
  • More communication efficient than state-of-the-art methods
  • Can escape from saddle points on tested non-convex problems

7 / 36

slide-8
SLIDE 8

Walkman communication efficiency

  • Comm. complexity for various algorithms for decentralized least squares

Algorithm Communication Complexity Walkman (prosposed) O

  • ln 1

ǫ

  • ·

n ln3(n) (1−λ2(P))2

  • D-ADMM [14]

O

  • ln 1

ǫ

  • ·

m (1−λ2(P))1/2

  • EXTRA [5]

O

  • ln 1

ǫ

  • ·

m 1−λ2(P)

  • Exact diffusion [8]

O

  • ln 1

ǫ

  • ·

m 1−λ2(P)

  • Walkman is most communication efficient when

λ2(P) ≤ 1 − 1 m2/3 λ2(P) is a measure of network connectivity, and m is the number of edges.

8 / 36

slide-9
SLIDE 9

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

9 / 36

slide-10
SLIDE 10

Problem reformulation

  • Recall the problem

minimize

x∈Rp

r(x) + 1 n

n

  • i=1

fi(x),

  • Create local variables yi and make then all equal to x.
  • Defining Y = col{y1, y2, · · · , yn} ∈ Rnp and F(Y) = n

i=1 fi(yi), we have

minimize

x, Y

r(x) + 1 nF(Y), subject to 1 ⊗ x − Y = 0, (2) where 1 = [1 1 . . . 1]T ∈ Rn and ⊗ is the Kronecker product

  • The above two problems are equivalent.

10 / 36

slide-11
SLIDE 11

Standard ADMM

  • The augmented Lagrangian function of problem (2) is

Lβ (x, Y; Z) = r(x) + 1 n

  • F(Y) + Z, 1 ⊗ x − Y + β

2 1 ⊗ x − Y2 , where Z ∈ Rnp is the dual variable (Lagrange multiplier)

  • The standard ADMM to solve (2) is

¯ xk+1 = 1 n

n

  • i=1

(yk

i − zk i

β ), xk+1 = prox 1

β r(¯

xk+1), yk+1

i

= prox 1

β fi

  • xk+1 + zk

i

β

  • ,

∀i ∈ V zk+1

i

= zk

i + β(xk+1 − yk+1 i

), ∀i ∈ V

  • Step 1 uses a reduce operation, implementable in a distributed 1-to-N

setting but not in our decentralized setting

11 / 36

slide-12
SLIDE 12

Derive Walkman

  • To update x with only one yi at a time.
  • To decentralize the computation of ¯

xk+1, we propose ¯ xk+1 = 1 n

n

  • i=1

(yk

i − zk i

β ), xk+1 = prox 1

β r(¯

xk+1), (4) yk+1

i

=

  • prox 1

β fi(xk+1 +

zk

i

β ),

i = ik, yk

i ,

  • therwise,

(5) zk+1

i

=

  • zk

i + β(xk+1 − yk+1 i

), i = ik zk

i ,

  • therwise.

(6)

  • A walker will carry ¯

x while visiting a sequence of nodes

12 / 36

slide-13
SLIDE 13
  • Recall: among
  • y1 − z1

β , · · · , yn − zn β

  • , only yik − zik

β is updated.

  • The computation of ¯

xk+2 is equivalent to ¯ xk+2 = ¯ xk+1

  • from neighbor

+ 1 n

  • yk+1

ik

− zk+1

ik

β

  • − 1

n

  • yk

ik − zk ik

β

  • local information

(7) Such computation can be conducted locally. (7), (4), (5), (6), (9)

13 / 36

slide-14
SLIDE 14
  • It is expensive to solve subproblem

minimize

y

fi(y) + β 2 y − (xk+1 + zk

i

β )2 (8)

  • When (8) is not easy to solve, we can linearize (8) and update yi cheaply

yk+1

i

=

  • xk+1 + 1

β zk i − 1 β ∇fi(yk i ),

i = ik yk

i ,

  • therwise.

(9)

14 / 36

slide-15
SLIDE 15

Walkman [15]

  • A walker carries ¯

xk around the network

  • Each local variable yk

i is expected to converge to x⋆

  • The node activation is Markovian: node ik+1 must be the neighbor of ik.

15 / 36

slide-16
SLIDE 16

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

16 / 36

slide-17
SLIDE 17

Assumptions

Assumption (A1: Random walk)

Random walk (ik)k≥0, ik ∈ V forms an irreducible, aperiodic Markov chain with transition probability matrix P and stationary distribution π. This guarantees each agent to be visited for infinitely many times

Assumption (A2: Coerciveness)

The objective function r(x) + 1

n

n

i=1 fi(x), is bounded from below over Rp

and is coercive over Rp, that is, r(xk) + 1

n

n

i=1 fi(xk) → ∞ for any sequence

xk ∈ Rp and xk → ∞. There exists a bounded minimal function value. The boundedness of xk implies the boundedness of the function value.

17 / 36

slide-18
SLIDE 18

Assumptions

Assumption (A3: fi smoothness)

Each fi(x) is L-Lipschitz differentiable

Assumption (A4: r is semiconvex)

Function r(x) is γ-semiconvex, that is, r(·) + γ

2 · 2 is convex.

18 / 36

slide-19
SLIDE 19

Convergence property

Theorem

Under A1-A4, for β>max{γ,2L + 2} (resp. β>max{γ,2L2 + L + 2}), it holds that any limit point (x∗, Y∗, Z∗) of the sequence (xk, Yk, Zk) generated by Walkman with proxfi (resp. ∇fi) satisfies: x∗ = x∗

i = y∗ i , i = 1, . . . , n, where

x∗ is a stationary point of (1), with probability 1, that is, Pr

  • 0 ∈ ∂r(x∗) + 1

n

n

  • i=1

∇fi(x∗)

  • = 1.

If the objective of (1) is convex, then x∗ is a minimizer. Implication: Walkman almost surely converges to a stationary point.

19 / 36

slide-20
SLIDE 20

Convergence rate

  • We examine the convergence rate for decentralized least squares

minimize

x

1 2n

n

  • i=1

Aix − bi2 This is a special case for problem (1) where r = 0.

  • Node i possesses Ai and bi
  • We need the mixing time to characterize the convergence rate

20 / 36

slide-21
SLIDE 21

Mixing time

  • For δ > 0, mixing time [16, Chapter 11] is defined as the smallest integer

τ(δ) such that

  • [Pτ(δ)]ij − πj
  • ≤ δπ∗,

∀i, j ∈ V. (10) where π∗ := mini∈V πi

  • After τ(δ), each agent j will be visited with probability ≈ πj.
  • Inequality (10) is guaranteed when

τ(δ) :=

  • 1

1 − λ2(P) ln √ 2 δπ∗

  • (11)

where λ2(P) := sup f TP/f : f T1 = 0, f ∈ Rn .

21 / 36

slide-22
SLIDE 22

Convergence rate

Theorem

Under A1, for β > 2σ∗

max + 2 with σ∗ max := maxi σmax(AT i Ai), we have linear

convergence: EYt − Y⋆2 ≤

  • 1 + n(1 − δ)π∗ν

4β2τ(δ)

t τ(δ)

  • C0,

∀ t ≥ 0, where ν =

(n−1)(β−σ∗

max)

n2

, and C0 is a constant only dependent on A1, · · · , An, b1, · · · , bn, and β. Quantity τ(δ) behaves as the iteration numbers in an epoch. Walkman solves the least squares problem at a linear convergence rate.

22 / 36

slide-23
SLIDE 23

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

23 / 36

slide-24
SLIDE 24

Communication complexity

  • For simplicity, assume P is a symmetric real matrix, modeling a gossip

matrix of undirected graph.

  • Communication complexity of Walkman:

O ln n ǫ

  • / ln

1 + (1 − δ)π∗ τ(δ)

  • epoch numbers

· τ(δ)

  • comm. per epoch
  • Substitute τ(δ) with (11)

O ln n ǫ

  • ln

1 + 1 − λ2(P) 2n ln(2n)

  • epoch numbers

·

  • ln(n)

1 − λ2(P)

  • comm. per epoch
  • The number of edge m is not explicitly involved.

24 / 36

slide-25
SLIDE 25

Communication comparison

Algorithm Communication Complexity Walkman (prosposed) O

  • ln 1

ǫ

  • ·

n ln3(n) (1−λ2(P))2

  • D-ADMM [14]

O

  • ln 1

ǫ

  • ·

m (1−λ2(P))1/2

  • EXTRA [5]

O

  • ln 1

ǫ

  • ·

m 1−λ2(P)

  • Exact diffusion [8]

O

  • ln 1

ǫ

  • ·

m 1−λ2(P)

  • Walkman is more communication-efficient when:

λ2(P) ≤ 1 − n2/3[ln(n)]2 m2/3 ≈ 1 − n m

2/3,

where the approximation holds for ln(n) ≪ n and with ln(n) ignored.

  • When the graph is moderately well connected, Walkman is more

communication-efficient.

25 / 36

slide-26
SLIDE 26

Communication comparison on complete graph

  • m = O(n2), λ2(P) = 0

Algorithm Communication Complexity Walkman (prosposed) O

  • ln 1

ǫ

  • n ln3(n)
  • D-ADMM [14]

O ln 1

ǫ

  • n2

EXTRA [5] O ln 1

ǫ

  • n2

Exact diffusion [8] O ln 1

ǫ

  • n2

26 / 36

slide-27
SLIDE 27

Communication comparison on random graph

  • Random Graphs [17], G(n, p): an n-node graph is generated with each

edge populating independently with probability p ∈ (0, 1). Pi,j =

1 dmax if

node i and j are connected, and 0 otherwise, where dmax is the maximum

  • degree. Pi,i = 1 −

j=i Pi,j.

  • E[m] = p(n2−n)

2

= O(n2), 1 − λ2(P) = O (1) . Algorithm Communication Complexity Walkman (prosposed) O

  • ln 1

ǫ

  • n ln3(n)
  • D-ADMM [14]

O ln 1

ǫ

  • n2

EXTRA [5] O ln 1

ǫ

  • n2

Exact diffusion [8] O ln 1

ǫ

  • n2

27 / 36

slide-28
SLIDE 28

Communication comparison on cycle graph

  • m = O(n), 1 − λ2(P) = O

1/n2 , Algorithm Communication Complexity Walkman (prosposed) O

  • ln 1

ǫ

  • n5 ln3(n)
  • D-ADMM [14]

O ln 1

ǫ

  • n2

EXTRA [5] O ln 1

ǫ

  • n3

Exact diffusion [8] O ln 1

ǫ

  • n3

28 / 36

slide-29
SLIDE 29

Outline

  • 1. Decentralized optimization
  • 2. The Walkman method
  • 3. Convergence
  • 4. Communication analysis
  • 5. Simulation results

29 / 36

slide-30
SLIDE 30

Convex Problem: Least Squares

minimize 1 n

n

  • i=1

1 2Aix − bi2,

Communication Cost 101 102 103 104 105 106 ∥Yk − Y∗∥2/n 10-15 10-10 10-5 100

Walkman (11b) Walkman (11b') D-ADMM EXTRA Exact Diffusion RW Incremental (constant stepsize) RW Incremental (decaying stepsize)

Running Time 101 102 103 104 105 106 10-15 10-10 10-5 100

  • n = 50 nodes are uniformly placed in a 30 × 30 square; r = 15.
  • Ai ∈ R5×10, x ∈ R10 and bi := Aix0 + vi.

30 / 36

slide-31
SLIDE 31

Convex Problem: Sparsity Inducing Logistic Regression

minimize

x∈Rp

λx1+ 1 nb

n

  • i=1

b

  • j=1

log 1+exp(−yijvT

ijx)

,

Communication Cost 101 102 103 104 105 106 ∥Yk − Y∗∥2/n 10-15 10-10 10-5 100

Walkman (11b) Walkman (11b') PG-EXTRA RW Incremental (constant stepsize) RW Incremental (decaying stepsize)

Running Time 101 102 103 104 10-15 10-10 10-5 100

  • vij ∈ R5 is the feature and yij ∈ {−1, +1} is the label; b = 10
  • vij ∼ N(0, 1);

31 / 36

slide-32
SLIDE 32

Nonconvex Problem: Nonnegative Principal Component Analysis

minimize

x∈{x:x≤1,xi≥0,∀i∈{1,··· ,p}}

1 n

n

  • i=1

−xT1 b

b

  • j=1

yijyT

ij

  • x

Communication Cost 101 102 103 104 105 Optimality Gap 10-10 100 Running Time 101 102 103 10-10 100

Walkman (11b) Walkman (11b') PG-EXTRA RW Incremental (constant stepsize) RW Incremental (decaying stepsize)

Communication Cost 101 102 103 104 105 f(xk) − f(x∗) 10-15 10-10 10-5 100 Iterations 200 400 600 800 1000 0.2 0.2005 0.201 0.2015 0.202 0.2025 0.203

  • Walkman escapes from saddle point and reaches lower function value.

32 / 36

slide-33
SLIDE 33

Summary

  • Walkman converges exactly to the stationary point with fixed step-size
  • Walkman is communication efficient than state-of-the-art methods
  • Random-walk algorithms have great potential to save communications
  • Acknowledgements: NSF, NSFC
  • Report: https://arxiv.org/pdf/1804.06568.pdf

33 / 36

slide-34
SLIDE 34

References:

[1] A. Nedi´ c and A. Ozdaglar. Distributed subgradient methods for multi-agent

  • ptimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.

[2] A. H. Sayed. Adaptive networks. Proceedings of the IEEE, 102(4):460–497, April 2014. [3] G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear

  • regression. IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010.

[4] J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. P¨

  • uschel. D-ADMM: A

communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013. [5] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [6] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015. [7] A. Nedic, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.

34 / 36

slide-35
SLIDE 35

[8] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed. Exact dffusion for distributed

  • ptimization and learning – Part I: Algorithm development. to appear in IEEE

Transactions on Signal Processing, 2018. [9] Z. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. available as arXiv:1704.07807, April 2017. [10] D. P. Bertsekas. A new class of incremental gradient methods for least squares

  • problems. SIAM Journal on Optimization, 7(4):913–926, 1997.

[11] S. S. Ram, A Nedi´ c, and V. V. Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009. [12] C. G. Lopes and A. H. Sayed. Incremental adaptive strategies over distributed

  • networks. IEEE Transactions on Signal Processing, 55(8):4064–4077, 2007.

[13] C. G. Lopes and A. H. Sayed. Randomized incremental protocols over adaptive

  • networks. In Proc. ICASSP, pages 3514–3517, Dallas, TX, 2010.

[14] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750–1761, 2014. [15] X. Mao, K. Yuan, Y. Hu, Y. Gu, A. H. Sayed, and W. Yin. Walkman: A communication-efficient random-walk algorithm for decentralized optimization. arXiv preprint arXiv:1804.06568, 2018.

35 / 36

slide-36
SLIDE 36

[16] D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. Second Edition, American Mathematical Soc., 2017. [17] E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.

36 / 36