Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - PowerPoint PPT Presentation

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020

Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction, JMLR, 2020. 1

Distributed empirical risk minimization Distributed/Federated learning: due to privacy and scalability, data are distributed at multiple locations / workers / agents. Let M = ∪ i M i be a data partition with equal splitting: n f ( x ) := 1 1 � � f i ( x ) , where f i ( x ) := ℓ ( x ; z ) . n ( N/n ) i =1 z ∈M i f 1 ( x ) f 5 ( x ) N = number of total samples n = number of agents N/n = number of local samples f 2 ( x ) �� f 4 ( x ) m f 3 ( x ) 2

Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 3

Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! 3

Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! 3

Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! Guiding principle: more local computation leads to less communication 3

Two distributed schemes f 1 ( x ) f 5 ( x ) f 2 ( x ) f 4 ( x ) f 3 ( x ) Master/slave model PS coordinates global information sharing 4

Two distributed schemes f 1 ( x ) f 1 ( x ) f 5 ( x ) f 5 ( x ) f 2 ( x ) f 2 ( x ) f 4 ( x ) f 4 ( x ) f 3 ( x ) f 3 ( x ) Master/slave model Network model PS coordinates global information agents share local information over a sharing graph topology 4

Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5

Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Approximate NEwton (DANE) (Shamir et. al., 2014) : + µ � � � � x − x t − 1 � � 2 x t ∇ f i ( x t − 1 ) − ∇ f ( x t − 1 ) , x i = argmin f i ( x ) − 2 2 x • Quasi-Newton method and less sensitive to ill-conditioning. 5

Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Stochastic Variance-Reduced Gradients (Cen et. al., 2020) : = x t,s − 1 − η v t,s − 1 x t,s ⇐ , s = 1 , 2 , . . . , i i i � �� variance-reduced stochastic gradient • Better local computation efficiency. 5

Naive extension to the network setting f 1 ( x ) f 5 ( x ) { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� surrogate of ∇ f ( x t ) surrogate of x t 6

Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� surrogate of ∇ f ( x t ) surrogate of x t 6

Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� surrogate of ∇ f ( x t ) surrogate of x t Consensus needs to be designed carefully in the network setting! 6

Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? 7

Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t 7

Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t • Key property: the average of { q t i } dynamically tracks the average of { r t i } ; n q t = 1 ⊤ 1 ⊤ n r t , M. Zhu and S. Martnez. ”Discrete-time dynamic average consensus.” Automatica 2010. 7

Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� gradient tracking 8

Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� gradient tracking We can now apply the same DANE and SVRG-type local updates! 8

Linear Regression: Well-Conditioned f i ( x ) = � y i − A i x � 2 A i ∈ R 1000 × 40 2 , Well-conditioned Well-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluation. The condition number κ = 10 . ER graph ( p = 0 . 3 ), 20 agents. 9

Linear Regression: Ill-Conditioned A i ∈ R 1000 × 40 f i ( x ) = � y i − A i x � 2 2 , Ill-conditioned Ill-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluations. The condition number κ = 10 4 . ER graph ( p = 0 . 3 ), 20 agents. 10

Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 11

Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 K = 2 11

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - PowerPoint PPT Presentation

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020 Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Internet communication The internet A redundant, decentralized communication network

A Decentralized and Distributed E-voting Scheme Based on Cryptographic Shuffles Decentralized

Decentralized Trust Management for Decentralized Trust Management for Ad-Hoc Peer-to-Peer

monitoring decentralized specifications Antoine El-Hokayem Ylis Falcone Univ. Grenoble Alpes,

PERPETUAL PERPETUAL DECENTRALIZED MANAGEMENT OF DIGITAL OBJECTS DECENTRALIZED MANAGEMENT OF

Decentralized Clinical Trials In 2020: A Global Survey Decentralized and Hybrid Trials 2020

Fixing healthcare data exchange with decentralized FOSS Protect your API's with a decentralized

Alternatives to Blockchains Sarah Meiklejohn (University College London) fully decentralized

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Decentralized Matching with Aligned Preferences Muriel Niederle Leeat Yariv May 7, 2011

Decentralized Matching with Aligned Preferences Muriel Niederle Leeat Yariv May 7, 2011

Decentralized Consensus Proto cols 1 Goals of the lecture Decentralized Consensus

Decentralized Machine Learning ICML 2020 Kevin Hsieh , Amar Phanishayee, Onur Mutlu, Phillip

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Information Flow Security (2) DD2460 Software Safety and Security: Part III, lecture 3 Gurvan Le

Dynamic Geometry Processing EG 2012 Tutorial Will Chang, Hao Li, Niloy Mitra, Mark Pauly,

Introduction to Mobile Robotics Bayes Filter Extended Kalman Filter Wolfram Burgard, Cyrill

Truncation Errors Numerical Integration Multiple Support Excitation Giacomo Boffi March 26,

Programming for the 0/1 Knapsack Problem Nirmal Prajapati Sanjay Rajopadhye Tarequl Islam Sifat

Analyzing Personality through Social Media Profile Picture Choice Leqi Liu , Daniel Preot

dynamic spatial proteomic experiments European Bioconductor conference 2019 Mis-localisation is

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,