walkman a communication efficient random walk algorithm
play

Walkman: A Communication-Efficient Random-Walk Algorithm for - PowerPoint PPT Presentation

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao Kun Yuan Yubin Hu Yuantao Gu Ali H. Sayed Wotao Yin Tsinghua EE UCLA ECE EPFL Engineering UCLA Math 1 / 36


  1. Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao ⋄ Kun Yuan ∗ Yubin Hu ⋄ Yuantao Gu ⋄ Ali H. Sayed † Wotao Yin ‡ ⋄ Tsinghua EE ∗ UCLA ECE † EPFL Engineering ‡ UCLA Math 1 / 36

  2. Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 2 / 36

  3. Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 3 / 36

  4. Decentralized optimization • Consider a decentralized optimization problem over a network ( V, E ) n r ( x ) + 1 � min f i ( x ) , (1) n x ∈ R p i =1 where n is the number of nodes • Node i has access to f i ( x ) . All nodes can access r ( x ) . • Both f i ( x ) and r ( x ) can be non-convex 4 / 36

  5. Gossip-based approaches Figure: Gossip-based communication • Agent communicates with all, or a random subset, of direct neighbors • Prior methods: DGD[1], diffusion[2], D-ADMM[3, 4], EXTRA[5], PG-EXTRA[6], DIGing[7], Exact diffusion[8], NIDS[9] ... • Convergence rates are comparable to standard centralized optimization. • Every agent communicates ⇒ per-iteration comm. cost at O ( n ) – O ( n 2 ) . 5 / 36

  6. Random-walk approaches Figure: A random walk (1 , 6 , 9 , 1 , 2 , 6 , 5 , ... ) • A walker moves x through the network and updates it. x k is the k th value. • Agent i receiving x will update it with local (sub)gradient ∇ f i . • O (1) communication per iteration. • Prior works [10–13] require decaying step-sizes; slow. 6 / 36

  7. Proposed method: Walkman • Walkman is a random-walk (RW) algorithm • Exact convergence with fixed step-size; much faster than existing RWs • Convergence guarantee established for non-convex and convex scenarios • More communication efficient than state-of-the-art methods • Can escape from saddle points on tested non-convex problems 7 / 36

  8. Walkman communication efficiency • Comm. complexity for various algorithms for decentralized least squares Algorithm Communication Complexity � � ln � 1 � n ln 3 ( n ) Walkman (prosposed) O · (1 − λ 2 ( P )) 2 ǫ � �� ln � 1 � · � m D-ADMM [14] O ǫ (1 − λ 2 ( P )) 1 / 2 � �� ln � 1 � · � m EXTRA [5] O ǫ 1 − λ 2 ( P ) � �� ln � 1 � · � m Exact diffusion [8] O ǫ 1 − λ 2 ( P ) • Walkman is most communication efficient when 1 λ 2 ( P ) ≤ 1 − m 2 / 3 λ 2 ( P ) is a measure of network connectivity, and m is the number of edges. 8 / 36

  9. Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 9 / 36

  10. Problem reformulation • Recall the problem n r ( x ) + 1 � minimize f i ( x ) , n x ∈ R p i =1 • Create local variables y i and make then all equal to x . • Defining Y = col { y 1 , y 2 , · · · , y n } ∈ R np and F ( Y ) = � n i =1 f i ( y i ) , we have r ( x ) + 1 minimize nF ( Y ) , x, Y subject to 1 ⊗ x − Y = 0 , (2) where 1 = [1 1 . . . 1] T ∈ R n and ⊗ is the Kronecker product • The above two problems are equivalent. 10 / 36

  11. Standard ADMM • The augmented Lagrangian function of problem (2) is � 2 � 1 ⊗ x − Y � 2 � L β ( x, Y ; Z ) = r ( x ) + 1 F ( Y ) + � Z , 1 ⊗ x − Y � + β , n where Z ∈ R np is the dual variable (Lagrange multiplier) • The standard ADMM to solve (2) is n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ � x k +1 + z k � y k +1 i ∀ i ∈ V = prox 1 , i β f i β i + β ( x k +1 − y k +1 z k +1 = z k ∀ i ∈ V ) , i i • Step 1 uses a reduce operation, implementable in a distributed 1 -to- N setting but not in our decentralized setting 11 / 36

  12. Derive Walkman • To update x with only one y i at a time. x k +1 , we propose • To decentralize the computation of ¯ n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ (4) � z k β f i ( x k +1 + prox 1 β ) , i i = i k , y k +1 = (5) i y k i , otherwise , � i + β ( x k +1 − y k +1 z k ) , i = i k z k +1 i = (6) i z k i , otherwise . • A walker will carry ¯ x while visiting a sequence of nodes 12 / 36

  13. � � y 1 − z 1 β , · · · , y n − z n , only y i k − z i k • Recall: among is updated. β β x k +2 is equivalent to • The computation of ¯ � � � � z k +1 i k − z k + 1 − 1 x k +2 = i k i k x k +1 y k +1 y k ¯ ¯ − (7) i k n β n β � �� � from neighbor � �� � local information Such computation can be conducted locally. (7), (4), (5), (6), (9) 13 / 36

  14. • It is expensive to solve subproblem 2 � y − ( x k +1 + z k f i ( y ) + β β ) � 2 i minimize (8) y • When (8) is not easy to solve, we can linearize (8) and update y i cheaply � x k +1 + 1 β z k i − 1 β ∇ f i ( y k i ) , i = i k y k +1 = (9) i y k i , otherwise. 14 / 36

  15. Walkman [15] x k around the network • A walker carries ¯ • Each local variable y k i is expected to converge to x ⋆ • The node activation is Markovian: node i k +1 must be the neighbor of i k . 15 / 36

  16. Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 16 / 36

  17. Assumptions Assumption (A1: Random walk) Random walk ( i k ) k ≥ 0 , i k ∈ V forms an irreducible, aperiodic Markov chain with transition probability matrix P and stationary distribution π . This guarantees each agent to be visited for infinitely many times Assumption (A2: Coerciveness) � n The objective function r ( x ) + 1 i =1 f i ( x ) , is bounded from below over R p n � n and is coercive over R p , that is, r ( x k ) + 1 i =1 f i ( x k ) → ∞ for any sequence n x k ∈ R p and � x k � → ∞ . There exists a bounded minimal function value. The boundedness of x k implies the boundedness of the function value. 17 / 36

  18. Assumptions Assumption (A3: f i smoothness) Each f i ( x ) is L -Lipschitz differentiable Assumption (A4: r is semiconvex) 2 � · � 2 is convex. Function r ( x ) is γ -semiconvex, that is, r ( · ) + γ 18 / 36

  19. Convergence property Theorem Under A1-A4, for β> max { γ, 2 L + 2 } (resp. β> max { γ, 2 L 2 + L + 2 } ), it holds that any limit point ( x ∗ , Y∗ , Z∗ ) of the sequence ( x k , Y k , Z k ) generated by Walkman with prox f i (resp. ∇ f i ) satisfies: x ∗ = x ∗ i = y ∗ i , i = 1 , . . . , n , where x ∗ is a stationary point of (1) , with probability 1 , that is, n � � 0 ∈ ∂r ( x ∗ ) + 1 � ∇ f i ( x ∗ ) Pr = 1 . n i =1 If the objective of (1) is convex, then x ∗ is a minimizer. Implication: Walkman almost surely converges to a stationary point. 19 / 36

  20. Convergence rate • We examine the convergence rate for decentralized least squares n 1 � � A i x − b i � 2 minimize 2 n x i =1 This is a special case for problem (1) where r = 0 . • Node i possesses A i and b i • We need the mixing time to characterize the convergence rate 20 / 36

  21. Mixing time • For δ > 0 , mixing time [16, Chapter 11] is defined as the smallest integer τ ( δ ) such that � � � [ P τ ( δ ) ] ij − π j � ≤ δπ ∗ , ∀ i, j ∈ V. (10) where π ∗ := min i ∈ V π i • After τ ( δ ) , each agent j will be visited with probability ≈ π j . • Inequality (10) is guaranteed when √ � � 1 2 τ ( δ ) := 1 − λ 2 ( P ) ln (11) δπ ∗ where λ 2 ( P ) := sup � � f T P � / � f � : f T 1 = 0 , f ∈ R n � . 21 / 36

  22. Convergence rate Theorem Under A1, for β > 2 σ ∗ max + 2 with σ ∗ max := max i σ max ( A T i A i ) , we have linear convergence: � − � � t � 1 + n (1 − δ ) π ∗ ν τ ( δ ) E � Y t − Y ⋆ � 2 ≤ C 0 , ∀ t ≥ 0 , 4 β 2 τ ( δ ) ( n − 1)( β − σ ∗ max ) where ν = , and C 0 is a constant only dependent on n 2 A 1 , · · · , A n , b 1 , · · · , b n , and β . Quantity τ ( δ ) behaves as the iteration numbers in an epoch. Walkman solves the least squares problem at a linear convergence rate. 22 / 36

  23. Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 23 / 36

  24. Communication complexity • For simplicity, assume P is a symmetric real matrix, modeling a gossip matrix of undirected graph. • Communication complexity of Walkman: � � � �� ln � n 1 + (1 − δ ) π ∗ � / ln � O · τ ( δ ) ǫ τ ( δ ) ���� � �� � comm. per epoch epoch numbers • Substitute τ ( δ ) with (11) � � � �� �� � � 1 + 1 − λ 2 ( P ) ln � n ln( n ) ln � · O 1 − λ 2 ( P ) ǫ 2 n ln(2 n ) � �� � � �� � epoch numbers comm. per epoch The number of edge m is not explicitly involved. 24 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend