quantized decentralized stochastic learning over directed
play

Quantized Decentralized Stochastic Learning over Directed Graphs - PowerPoint PPT Presentation

Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of


  1. Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of Pennsylvania Thirty-seventh International Conference on Machine Learning (ICML), 2020 1 / 30

  2. Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. 2 / 30

  3. Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. 3 / 30

  4. Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. In many cases, communication links are asymmetric due to failures and bottlenecks and communication is done over a directed graph [Tsianos et al. 2012, Nedic et al. 2014, Assran et al. 2020]. 4 / 30

  5. This Talk Link failure: Nodes communicate over a directed graph High communication cost: Nodes communicate compressed information Q ( x ) Compression operator Q : R d → R d 5 / 30

  6. Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) 6 / 30

  7. Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives √ √ z i ( T )) − f ⋆ � = O (1 / and α ( t ) = O (1 / T ) ⇒ � f ( � T ) , � T z i ( T ) = 1 � t =1 z i ( t ) T 7 / 30

  8. Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives √ √ z i ( T )) − f ⋆ � = O (1 / and α ( t ) = O (1 / T ) ⇒ � f ( � T ) , � T z i ( T ) = 1 � t =1 z i ( t ) T How can we incorporate quantized message exchanging for this setting? 8 / 30

  9. Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) 9 / 30

  10. Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) � x j ( t ) is stored in all out-neighbors of node j 10 / 30

  11. Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) � x j ( t ) is stored in all out-neighbors of node j � x j ( t ) → x j ( t ) therefore q j ( t ) → 0 (Similar to [Koloskova et al. 2018] ) 11 / 30

  12. Assumptions Assumptions on graph and connectivity 12 / 30

  13. Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] 13 / 30

  14. Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 14 / 30

  15. Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x 15 / 30

  16. Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x Bounded Stochastic Gradients, � � 2 � � ≤ D 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) E ζ i ∼D i � 16 / 30

  17. Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x Bounded Stochastic Gradients, � � 2 � � ≤ D 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) E ζ i ∼D i � Bounded Variance, � � 2 � � ≤ σ 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) − ∇ f i ( x ) E ζ i ∼D i � 17 / 30

  18. Assumptions Assumption on quantization function The quantization function Q : R d → R d satisfies for all x ∈ R d , �� 2 � � � � ≤ ω 2 � x � 2 , � Q ( x ) − x (1) E Q � where 0 ≤ ω < 1. 18 / 30

  19. Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 19 / 30

  20. Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! 20 / 30

  21. Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) 21 / 30

  22. Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) Error is proportional to 1 / √ n 22 / 30

  23. Convergence Results (Non-Convex objectives) Theorem 2 √ n Let ω ≤ C ( λ, γ ) and α = T . Then after sufficiently large √ L number of iterations, ( T ≥ 4 n ), it holds that � � �� � � 2 � � T n � � 1 1 1 � � � ∇ f x i ( t ) = O √ E � � � T n nT t =1 i =1 23 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend