Quantized Decentralized Stochastic Learning over Directed Graphs - PowerPoint PPT Presentation

Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of Pennsylvania Thirty-seventh International Conference on Machine Learning (ICML), 2020 1 / 30

Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. 2 / 30

Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. 3 / 30

Decentralized Optimization Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. In many cases, communication links are asymmetric due to failures and bottlenecks and communication is done over a directed graph [Tsianos et al. 2012, Nedic et al. 2014, Assran et al. 2020]. 4 / 30

This Talk Link failure: Nodes communicate over a directed graph High communication cost: Nodes communicate compressed information Q ( x ) Compression operator Q : R d → R d 5 / 30

Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) 6 / 30

Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives √ √ z i ( T )) − f ⋆ � = O (1 / and α ( t ) = O (1 / T ) ⇒ � f ( � T ) , � T z i ( T ) = 1 � t =1 z i ( t ) T 7 / 30

Introduction: Push-sum Algorithm Decentralized optimization over directed graphs with exact communication: = � n  x i ( t + 1) j =1 w ij x j ( t ) − α ( t ) ∇ f i ( z i ( t ))   = � n y i ( t + 1) j =1 w ij y j ( t )   z i ( t + 1) = x i ( t + 1) / y i ( t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives √ √ z i ( T )) − f ⋆ � = O (1 / and α ( t ) = O (1 / T ) ⇒ � f ( � T ) , � T z i ( T ) = 1 � t =1 z i ( t ) T How can we incorporate quantized message exchanging for this setting? 8 / 30

Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) 9 / 30

Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) � x j ( t ) is stored in all out-neighbors of node j 10 / 30

Proposed Algorithm: Quantized Push-sum We propose the quantized Push-sum algorithm for stochastic optimization q i ( t ) = Q ( x i ( t ) − � x i ( t )) for all nodes k ∈ N out and j ∈ N in do i i send q i ( t ) and y i ( t ) to k and receive q j ( t ) and y j ( t ) from j . � x j ( t + 1) = � x j ( t ) + q j ( t ) end for x i ( t + 1) + � v i ( t + 1) = x i ( t ) − � w ij � x j ( t + 1) j ∈N in y i ( t + 1) = � i w ij y j ( t ) j ∈N in i z i ( t + 1) = v i ( t + 1) / y i ( t + 1) x i ( t + 1) = v i ( t + 1) − α ( t + 1) ∇ F i ( z i ( t + 1)) � x j ( t ) is stored in all out-neighbors of node j � x j ( t ) → x j ( t ) therefore q j ( t ) → 0 (Similar to [Koloskova et al. 2018] ) 11 / 30

Assumptions Assumptions on graph and connectivity 12 / 30

Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] 13 / 30

Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 14 / 30

Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x 15 / 30

Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x Bounded Stochastic Gradients, � � 2 � � ≤ D 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) E ζ i ∼D i � 16 / 30

Assumptions Assumptions on graph and connectivity Strongly connected graph and W ij ≥ 0, W ii > 0 , ∀ i , j ∈ [ n ] Note that this results in � W t − φ 1 ′ � ≤ C λ t , ∀ t ≥ 1 where φ ∈ R n , 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients, � � � � � � � � � , ∀ x , y ∈ R d � ∇ f i ( y ) − ∇ f i ( x ) � ≤ L � y − x Bounded Stochastic Gradients, � � 2 � � ≤ D 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) E ζ i ∼D i � Bounded Variance, � � 2 � � ≤ σ 2 , ∀ x ∈ R d � ∇ F i ( x , ζ i ) − ∇ f i ( x ) E ζ i ∼D i � 17 / 30

Assumptions Assumption on quantization function The quantization function Q : R d → R d satisfies for all x ∈ R d , �� 2 � � � � ≤ ω 2 � x � 2 , � Q ( x ) − x (1) E Q � where 0 ≤ ω < 1. 18 / 30

Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 19 / 30

Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! 20 / 30

Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) 21 / 30

Convergence Results (Convex objectives) 1 Define γ := � W − I � 2 and C ( λ, γ ) := � 6 C 2 (1 − λ )2 )(1+ γ 2 ) 6(1+ Theorem 1 Assume local objectives f i are convex for all i ∈ [ n ]. By choosing √ n ω ≤ C ( λ, γ ) and α = T , for all T ≥ 1, it holds that, √ 8 L � � � � T � 1 1 − f ⋆ = O z i ( t + 1) √ E f T nT t =1 Time average of local parameters z i converges to the exact solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) Error is proportional to 1 / √ n 22 / 30

Convergence Results (Non-Convex objectives) Theorem 2 √ n Let ω ≤ C ( λ, γ ) and α = T . Then after sufficiently large √ L number of iterations, ( T ≥ 4 n ), it holds that � � �� 2 � � T n � � 1 1 1 � � � ∇ f x i ( t ) = O √ E � � � T n nT t =1 i =1 23 / 30

Quantized Decentralized Stochastic Learning over Directed Graphs - PowerPoint PPT Presentation

Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of

Quantized Vortices and Quantized Vortices and Quantum Turbulence Quantum Turbulence Makoto

Quantized Quantized superfluid vortices superfluid vortices in the unitary Fermi gas in the

Finding Strongly Connected Components Directed Acyclic Graphs Directed Acyclic Graphs Directed

Quantized cosmological spacetimes and higher spin in the IKKT model Harold Steinacker Department

Incidence Relations and Directed Cycles Hao Wu George Washington University Directed graphs and

3.5 Connectivity in Directed Graphs Directed Graphs Directed graph. G = (V, E) Edge (u, v)

6.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

5.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

CS 401 Greedy Algorithms Xiaorui Sun 1 Directed Acyclic Graphs (DAG) Def: A DAG is a directed

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Thermodynamic phase transition and quantized vortices in Bose-Einstein condensates Michikazu

Real-valued average consensus over noisy quantized channels Andrea Censi Richard Murray Control

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

A Decentralized and Distributed E-voting Scheme Based on Cryptographic Shuffles Decentralized

Decentralized Trust Management for Decentralized Trust Management for Ad-Hoc Peer-to-Peer

Graphs-Introduction November 9, 2016 CMPE 250 Graphs-Introduction November 9, 2016 1 / 32

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou

Graphs with three eigenvalues Jack Koolen Joint work with Ximing Cheng and it is work in progress

Unavoidable Induced Subgraphs of Large 2-Connected Graphs Sarah Allred* Guoli Ding Bogdan

Nowhere-zero Flows: An Introduction Daniel W. Cranston Virginia Commonwealth University

CS 101: Computer Programming and Utilization About These Slides Based on Chapter 6 of the

Program for the Simple Problem main_program { float income, tax; cin >> income; if

Operational semantics of programs Giuseppe De Giacomo 1 Programs We will consider a very simple

Quantized Decentralized Stochastic Learning over Directed Graphs - PowerPoint PPT Presentation

Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of

Quantized Vortices and Quantized Vortices and Quantum Turbulence Quantum Turbulence Makoto

Quantized Quantized superfluid vortices superfluid vortices in the unitary Fermi gas in the

Finding Strongly Connected Components Directed Acyclic Graphs Directed Acyclic Graphs Directed

Quantized cosmological spacetimes and higher spin in the IKKT model Harold Steinacker Department

Incidence Relations and Directed Cycles Hao Wu George Washington University Directed graphs and

3.5 Connectivity in Directed Graphs Directed Graphs Directed graph. G = (V, E) Edge (u, v)

6.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

5.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

CS 401 Greedy Algorithms Xiaorui Sun 1 Directed Acyclic Graphs (DAG) Def: A DAG is a directed

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Thermodynamic phase transition and quantized vortices in Bose-Einstein condensates Michikazu

Real-valued average consensus over noisy quantized channels Andrea Censi Richard Murray Control

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

A Decentralized and Distributed E-voting Scheme Based on Cryptographic Shuffles Decentralized

Decentralized Trust Management for Decentralized Trust Management for Ad-Hoc Peer-to-Peer

Graphs-Introduction November 9, 2016 CMPE 250 Graphs-Introduction November 9, 2016 1 / 32

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou

Graphs with three eigenvalues Jack Koolen Joint work with Ximing Cheng and it is work in progress

Unavoidable Induced Subgraphs of Large 2-Connected Graphs Sarah Allred* Guoli Ding Bogdan

Nowhere-zero Flows: An Introduction Daniel W. Cranston Virginia Commonwealth University

CS 101: Computer Programming and Utilization About These Slides Based on Chapter 6 of the

Program for the Simple Problem main_program { float income, tax; cin &gt;&gt; income; if

Operational semantics of programs Giuseppe De Giacomo 1 Programs We will consider a very simple

Program for the Simple Problem main_program { float income, tax; cin >> income; if