Walkman: A Communication-Efficient Random-Walk Algorithm for - PowerPoint PPT Presentation

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao ⋄ Kun Yuan ∗ Yubin Hu ⋄ Yuantao Gu ⋄ Ali H. Sayed † Wotao Yin ‡ ⋄ Tsinghua EE ∗ UCLA ECE † EPFL Engineering ‡ UCLA Math 1 / 36

Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 2 / 36

Decentralized optimization • Consider a decentralized optimization problem over a network ( V, E ) n r ( x ) + 1 � min f i ( x ) , (1) n x ∈ R p i =1 where n is the number of nodes • Node i has access to f i ( x ) . All nodes can access r ( x ) . • Both f i ( x ) and r ( x ) can be non-convex 4 / 36

Gossip-based approaches Figure: Gossip-based communication • Agent communicates with all, or a random subset, of direct neighbors • Prior methods: DGD[1], diffusion[2], D-ADMM[3, 4], EXTRA[5], PG-EXTRA[6], DIGing[7], Exact diffusion[8], NIDS[9] ... • Convergence rates are comparable to standard centralized optimization. • Every agent communicates ⇒ per-iteration comm. cost at O ( n ) – O ( n 2 ) . 5 / 36

Random-walk approaches Figure: A random walk (1 , 6 , 9 , 1 , 2 , 6 , 5 , ... ) • A walker moves x through the network and updates it. x k is the k th value. • Agent i receiving x will update it with local (sub)gradient ∇ f i . • O (1) communication per iteration. • Prior works [10–13] require decaying step-sizes; slow. 6 / 36

Proposed method: Walkman • Walkman is a random-walk (RW) algorithm • Exact convergence with fixed step-size; much faster than existing RWs • Convergence guarantee established for non-convex and convex scenarios • More communication efficient than state-of-the-art methods • Can escape from saddle points on tested non-convex problems 7 / 36

Walkman communication efficiency • Comm. complexity for various algorithms for decentralized least squares Algorithm Communication Complexity � � ln � 1 � n ln 3 ( n ) Walkman (prosposed) O · (1 − λ 2 ( P )) 2 ǫ � �� ln � 1 � · � m D-ADMM [14] O ǫ (1 − λ 2 ( P )) 1 / 2 � �� ln � 1 � · � m EXTRA [5] O ǫ 1 − λ 2 ( P ) � �� ln � 1 � · � m Exact diffusion [8] O ǫ 1 − λ 2 ( P ) • Walkman is most communication efficient when 1 λ 2 ( P ) ≤ 1 − m 2 / 3 λ 2 ( P ) is a measure of network connectivity, and m is the number of edges. 8 / 36

Problem reformulation • Recall the problem n r ( x ) + 1 � minimize f i ( x ) , n x ∈ R p i =1 • Create local variables y i and make then all equal to x . • Defining Y = col { y 1 , y 2 , · · · , y n } ∈ R np and F ( Y ) = � n i =1 f i ( y i ) , we have r ( x ) + 1 minimize nF ( Y ) , x, Y subject to 1 ⊗ x − Y = 0 , (2) where 1 = [1 1 . . . 1] T ∈ R n and ⊗ is the Kronecker product • The above two problems are equivalent. 10 / 36

Standard ADMM • The augmented Lagrangian function of problem (2) is � 2 � 1 ⊗ x − Y � 2 � L β ( x, Y ; Z ) = r ( x ) + 1 F ( Y ) + � Z , 1 ⊗ x − Y � + β , n where Z ∈ R np is the dual variable (Lagrange multiplier) • The standard ADMM to solve (2) is n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ � x k +1 + z k � y k +1 i ∀ i ∈ V = prox 1 , i β f i β i + β ( x k +1 − y k +1 z k +1 = z k ∀ i ∈ V ) , i i • Step 1 uses a reduce operation, implementable in a distributed 1 -to- N setting but not in our decentralized setting 11 / 36

Derive Walkman • To update x with only one y i at a time. x k +1 , we propose • To decentralize the computation of ¯ n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ (4) � z k β f i ( x k +1 + prox 1 β ) , i i = i k , y k +1 = (5) i y k i , otherwise , � i + β ( x k +1 − y k +1 z k ) , i = i k z k +1 i = (6) i z k i , otherwise . • A walker will carry ¯ x while visiting a sequence of nodes 12 / 36

� � y 1 − z 1 β , · · · , y n − z n , only y i k − z i k • Recall: among is updated. β β x k +2 is equivalent to • The computation of ¯ � � � � z k +1 i k − z k + 1 − 1 x k +2 = i k i k x k +1 y k +1 y k ¯ ¯ − (7) i k n β n β � �� from neighbor � �� local information Such computation can be conducted locally. (7), (4), (5), (6), (9) 13 / 36

• It is expensive to solve subproblem 2 � y − ( x k +1 + z k f i ( y ) + β β ) � 2 i minimize (8) y • When (8) is not easy to solve, we can linearize (8) and update y i cheaply � x k +1 + 1 β z k i − 1 β ∇ f i ( y k i ) , i = i k y k +1 = (9) i y k i , otherwise. 14 / 36

Walkman [15] x k around the network • A walker carries ¯ • Each local variable y k i is expected to converge to x ⋆ • The node activation is Markovian: node i k +1 must be the neighbor of i k . 15 / 36

Assumptions Assumption (A1: Random walk) Random walk ( i k ) k ≥ 0 , i k ∈ V forms an irreducible, aperiodic Markov chain with transition probability matrix P and stationary distribution π . This guarantees each agent to be visited for infinitely many times Assumption (A2: Coerciveness) � n The objective function r ( x ) + 1 i =1 f i ( x ) , is bounded from below over R p n � n and is coercive over R p , that is, r ( x k ) + 1 i =1 f i ( x k ) → ∞ for any sequence n x k ∈ R p and � x k � → ∞ . There exists a bounded minimal function value. The boundedness of x k implies the boundedness of the function value. 17 / 36

Assumptions Assumption (A3: f i smoothness) Each f i ( x ) is L -Lipschitz differentiable Assumption (A4: r is semiconvex) 2 � · � 2 is convex. Function r ( x ) is γ -semiconvex, that is, r ( · ) + γ 18 / 36

Convergence property Theorem Under A1-A4, for β> max { γ, 2 L + 2 } (resp. β> max { γ, 2 L 2 + L + 2 } ), it holds that any limit point ( x ∗ , Y∗ , Z∗ ) of the sequence ( x k , Y k , Z k ) generated by Walkman with prox f i (resp. ∇ f i ) satisfies: x ∗ = x ∗ i = y ∗ i , i = 1 , . . . , n , where x ∗ is a stationary point of (1) , with probability 1 , that is, n � � 0 ∈ ∂r ( x ∗ ) + 1 � ∇ f i ( x ∗ ) Pr = 1 . n i =1 If the objective of (1) is convex, then x ∗ is a minimizer. Implication: Walkman almost surely converges to a stationary point. 19 / 36

Convergence rate • We examine the convergence rate for decentralized least squares n 1 � � A i x − b i � 2 minimize 2 n x i =1 This is a special case for problem (1) where r = 0 . • Node i possesses A i and b i • We need the mixing time to characterize the convergence rate 20 / 36

Mixing time • For δ > 0 , mixing time [16, Chapter 11] is defined as the smallest integer τ ( δ ) such that � � � [ P τ ( δ ) ] ij − π j � ≤ δπ ∗ , ∀ i, j ∈ V. (10) where π ∗ := min i ∈ V π i • After τ ( δ ) , each agent j will be visited with probability ≈ π j . • Inequality (10) is guaranteed when √ � � 1 2 τ ( δ ) := 1 − λ 2 ( P ) ln (11) δπ ∗ where λ 2 ( P ) := sup � � f T P � / � f � : f T 1 = 0 , f ∈ R n � . 21 / 36

Convergence rate Theorem Under A1, for β > 2 σ ∗ max + 2 with σ ∗ max := max i σ max ( A T i A i ) , we have linear convergence: � − � � t � 1 + n (1 − δ ) π ∗ ν τ ( δ ) E � Y t − Y ⋆ � 2 ≤ C 0 , ∀ t ≥ 0 , 4 β 2 τ ( δ ) ( n − 1)( β − σ ∗ max ) where ν = , and C 0 is a constant only dependent on n 2 A 1 , · · · , A n , b 1 , · · · , b n , and β . Quantity τ ( δ ) behaves as the iteration numbers in an epoch. Walkman solves the least squares problem at a linear convergence rate. 22 / 36

Communication complexity • For simplicity, assume P is a symmetric real matrix, modeling a gossip matrix of undirected graph. • Communication complexity of Walkman: � � � �� ln � n 1 + (1 − δ ) π ∗ � / ln � O · τ ( δ ) ǫ τ ( δ ) �� comm. per epoch epoch numbers • Substitute τ ( δ ) with (11) � � � �� 1 + 1 − λ 2 ( P ) ln � n ln( n ) ln � · O 1 − λ 2 ( P ) ǫ 2 n ln(2 n ) � �� epoch numbers comm. per epoch The number of edge m is not explicitly involved. 24 / 36

Walkman: A Communication-Efficient Random-Walk Algorithm for - PowerPoint PPT Presentation

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao Kun Yuan Yubin Hu Yuantao Gu Ali H. Sayed Wotao Yin Tsinghua EE UCLA ECE EPFL Engineering UCLA Math 1 / 36

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Mixing time for a random walk on a ring Stephen Connor Joint work with Michael Bate Aspects of

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Short Walks in Higher Dimensions Ghislain McKay Febuary 3, 2015 What is a Random Walk? A path

Advanced Algorithms (XII) Shanghai Jiao Tong University Chihao Zhang May 25, 2020 Random Walk

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random walk on the torus Jean-Baptiste Boyer (IMB / ModalX) May 16, 2016 Jean-Baptiste Boyer

Random Walks Will Perkins February 5, 2013 Simple Random Walk S 0 = 0, S n = X 1 + X 2 + . . . X

Southeast Cooler Corporation Southeast Cooler Corporation Walk Walk- -In Cooler In Cooler

Turn Right Walk forward 100 pixels Start Here Walk Forward Turn Left and 100 pixels walk

Onelight.com Training Series Connecting the Pyramids and the Crystal Cities the ISIS Walk 2 The

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Random Walks Conditioned to Stay Positive Bob Keener Let S n be a random walk formed by summing

RANDOM WALK IN DYNAMIC RANDOM ENVIRONMENT Frank den Hollander Leiden University The Netherlands

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Security Framework for Decentralized Shared Calendars Jagdish Prasad Achara Research Master of

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

Decentralized Document Delivery Who am I? Were hiring!! What I do @ hypothes.is an

132243 Business & Social Responsibilities Advertising Ethics and Consumer Privacy 1

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ Today's

Walkman: A Communication-Efficient Random-Walk Algorithm for - PowerPoint PPT Presentation

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao Kun Yuan Yubin Hu Yuantao Gu Ali H. Sayed Wotao Yin Tsinghua EE UCLA ECE EPFL Engineering UCLA Math 1 / 36

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Mixing time for a random walk on a ring Stephen Connor Joint work with Michael Bate Aspects of

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Short Walks in Higher Dimensions Ghislain McKay Febuary 3, 2015 What is a Random Walk? A path

Advanced Algorithms (XII) Shanghai Jiao Tong University Chihao Zhang May 25, 2020 Random Walk

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random walk on the torus Jean-Baptiste Boyer (IMB / ModalX) May 16, 2016 Jean-Baptiste Boyer

Random Walks Will Perkins February 5, 2013 Simple Random Walk S 0 = 0, S n = X 1 + X 2 + . . . X

Southeast Cooler Corporation Southeast Cooler Corporation Walk Walk- -In Cooler In Cooler

Turn Right Walk forward 100 pixels Start Here Walk Forward Turn Left and 100 pixels walk

Onelight.com Training Series Connecting the Pyramids and the Crystal Cities the ISIS Walk 2 The

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Random Walks Conditioned to Stay Positive Bob Keener Let S n be a random walk formed by summing

RANDOM WALK IN DYNAMIC RANDOM ENVIRONMENT Frank den Hollander Leiden University The Netherlands

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Security Framework for Decentralized Shared Calendars Jagdish Prasad Achara Research Master of

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

Decentralized Document Delivery Who am I? Were hiring!! What I do @ hypothes.is an

132243 Business &amp; Social Responsibilities Advertising Ethics and Consumer Privacy 1

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ Today's

132243 Business & Social Responsibilities Advertising Ethics and Consumer Privacy 1