communication efficient decentralized learning
play

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm - PowerPoint PPT Presentation

Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020 Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and


  1. Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020

  2. Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction, JMLR, 2020. 1

  3. Distributed empirical risk minimization Distributed/Federated learning: due to privacy and scalability, data are distributed at multiple locations / workers / agents. Let M = ∪ i M i be a data partition with equal splitting: n f ( x ) := 1 1 � � f i ( x ) , where f i ( x ) := ℓ ( x ; z ) . n ( N/n ) i =1 z ∈M i f 1 ( x ) f 5 ( x ) N = number of total samples n = number of agents N/n = number of local samples f 2 ( x ) ���� f 4 ( x ) m f 3 ( x ) 2

  4. Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 3

  5. Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! 3

  6. Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! 3

  7. Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! Guiding principle: more local computation leads to less communication 3

  8. Two distributed schemes f 1 ( x ) f 5 ( x ) f 2 ( x ) f 4 ( x ) f 3 ( x ) Master/slave model PS coordinates global information sharing 4

  9. Two distributed schemes f 1 ( x ) f 1 ( x ) f 5 ( x ) f 5 ( x ) f 2 ( x ) f 2 ( x ) f 4 ( x ) f 4 ( x ) f 3 ( x ) f 3 ( x ) Master/slave model Network model PS coordinates global information agents share local information over a sharing graph topology 4

  10. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5

  11. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5

  12. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5

  13. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5

  14. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Approximate NEwton (DANE) (Shamir et. al., 2014) : + µ � � � � x − x t − 1 � � 2 x t ∇ f i ( x t − 1 ) − ∇ f ( x t − 1 ) , x i = argmin f i ( x ) − 2 2 x • Quasi-Newton method and less sensitive to ill-conditioning. 5

  15. Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Stochastic Variance-Reduced Gradients (Cen et. al., 2020) : = x t,s − 1 − η v t,s − 1 x t,s ⇐ , s = 1 , 2 , . . . , i i i � �� � variance-reduced stochastic gradient • Better local computation efficiency. 5

  16. Naive extension to the network setting f 1 ( x ) f 5 ( x ) { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t 6

  17. Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t 6

  18. Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t Consensus needs to be designed carefully in the network setting! 6

  19. Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? 7

  20. Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� � � �� � correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t 7

  21. Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� � � �� � correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t • Key property: the average of { q t i } dynamically tracks the average of { r t i } ; n q t = 1 ⊤ 1 ⊤ n r t , M. Zhu and S. Martnez. ”Discrete-time dynamic average consensus.” Automatica 2010. 7

  22. Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� � gradient tracking 8

  23. Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� � gradient tracking We can now apply the same DANE and SVRG-type local updates! 8

  24. Linear Regression: Well-Conditioned f i ( x ) = � y i − A i x � 2 A i ∈ R 1000 × 40 2 , Well-conditioned Well-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluation. The condition number κ = 10 . ER graph ( p = 0 . 3 ), 20 agents. 9

  25. Linear Regression: Ill-Conditioned A i ∈ R 1000 × 40 f i ( x ) = � y i − A i x � 2 2 , Ill-conditioned Ill-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluations. The condition number κ = 10 4 . ER graph ( p = 0 . 3 ), 20 agents. 10

  26. Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 11

  27. Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 K = 2 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend