Dual-Way Gradient Sparsification for Asynchronous Distributed Deep - PowerPoint PPT Presentation

  Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu † Sun Yat-sen University Guangzhou, China 1

Outline 1. Introduction 2. The Proposed Algorithm 3. Performance Evaluation 4. Conclusion and Future Work 2

Introduction •Training may take an impractically long time •Growing volume of training data (e.g., ImageNet >1TB) •More complex model •Solution: distributed training •The common practice of current DL frameworks •Enabled by Parameters Servers (PS) or Ring All-Reduce •Synchronous SGD or Asynchronous SGD 3

Introduction 4

Introduction Communication overhead ： Distributed training can significantly reduce the total computation time. However, the communication overhead seriously affect the efficiency of training. Solutions ： Reduce the frequency / data size of communication. 5

Introduction • Gradient Quantization • 1-Bit SGD, QSGD, TernGrad: use fewer bits to represent value. • Gradient Sparsification • Threshold Sparsification: only gradients which are greater than a predefined threshold will be sent. • Gradient Dropping: remove the gradient of R% with the smallest absolute value. • Deep Gradient Compression: apply momentum correction to correct the disappearance of momentum discounting factor. 6

Introduction PS based ASGD Worker 2 Worker 1 Local Model downward upward …… Global Model Gradients Parameters Server Global Model Parameters 7

Contributions • Dual-Way Gradient Sparsification (DGS) • Model Difference Tracking • Dual-way Gradient Sparsification Operations • Eliminates the communication bottleneck • Sparsification Aware Momentum (SAMomentum) • A novel momentum designed for gradient sparsification scenario • Offers significant optimization 8

Contributions DGS SAMomentum Worker 2 Worker 1 Local Model downward upward Model Difference Dual-way …… Compressed Gradient Local Model Gradients Sparsification Model Difference Tracking Model Server Difference Global Model Tracking 9

Model Difference Tracking • Notions : The accumulation of updates at the time . • M t t : Model difference between the server and the worker . • G k , t k : Accumulation of model difference sent by the server to the worker . • v k k Server receives gradients ∇ 𝑙 , 𝑢 Update , accumulate gradients: M t 𝑁 𝑢 +1 = 𝑁 𝑢 − 𝜃 ∇ 𝑙 , t Calculation of the model difference ： 𝐻 𝑙 , 𝑢 +1 = 𝑁 𝑢 +1 − 𝑤 𝑙 , 𝑢 v k v 𝑙 , 𝑢 +1 = v 𝑙 , 𝑢 + 𝐻 𝑙 , 𝑢 +1 Accumulation of : Send model difference 𝐻 𝑙 , 𝑢 +1 10

Model Difference Tracking • What's changed? • DGS chooses to transmits model difference rather than the global G k , t +1 model • Model differences (residual gradients) that have not sent yet are recorded in , implicitly avoiding the loss of information. 𝑁 𝑢 +1 − v 𝑙 , 𝑢 • Now we can compress the downward communication! 11

Dual-way Gradient Sparsification Worker 2 Worker 1 Local Model downward upward Model Difference Compressed Local Model Gradients Worker side Server side Server Global Model 12

Dual-way Gradient Sparsification - Worker Side Select threshold 13

Dual-way Gradient Sparsification - Worker Side Sparsification 14

Dual-way Gradient Sparsification - Server Side Model Difference Tracking 15

Dual-way Gradient Sparsification - Server Side Secondary compression • Secondary compression guarantees the sparsity of the send-ready model difference in downward communication, no matter how many workers are running. • The server implicitly accumulates remaining gradient locally. • Eliminates the overhead of the downward communication. 16

Sparsification Aware Momentum DGS Worker 2 Worker 1 Bring optimization SAMomentum Local Model boost downward upward Dual-way Model Difference …… Eliminates the Compress Gradient Local communication Sparsification bottleneck Server Model Global Difference Tracking 17

SAMomentum - Background • Momentum is commonly used in deep training, which is known to offer a significant optimization boost. • However, indeterminate update intervals in gradient sparsification will result in the disappearance of momentum. 18

SAMomentum - Background Dense update : u t = mu t − 1 + η ∇ t , θ t +1 = θ t − u t After updates T t + T = η [ ⋯ + m T − 2 ∇ ( i ) t +1 ] + m T u ( i ) Dense u ( i ) t +2 + m T − 1 ∇ ( i ) t u ( i ) denotes the -th position of a flattened velocity i u t 19 t

SAMomentum - Background Sparse update : r k , t = r k , t − 1 + η * ∇ k , t , u t = mu t − 1 + sparsify ( r k , t ) r k , t = unsparsify ( r k , t ) , θ t +1 = θ t − u t Remaining Gradients After updates T t + T = η [ ⋯ + ∇ ( i ) t +1 ] + m T u ( i ) Sparse u ( i ) t +2 + ∇ ( i ) t u ( i ) denotes the -th position of a flattened velocity i u t 20 t

Momentum Disappearing t + T = η [ ⋯ + m T − 2 ∇ ( i ) t +1 ] + m T u ( i ) Dense u ( i ) t +2 + m T − 1 ∇ ( i ) t t + T = η [ ⋯ + ∇ ( i ) t +1 ] + m T u ( i ) Sparse u ( i ) t +2 + ∇ ( i ) t • Momentum factor controls the proportion of historical information. m • The disappearance of impairs the convergence performance. m 21

SAMomentum u k , t = mu k ,prev ( k ) + η ∇ k , t + unsparsify ( mu k ,prev( k ) + η ∇ k , t ) * ( m − 1 ) 1 g k , t = sparsify ( mu k ,prev( k ) + η ∇ k , t ) θ t +1 = θ t − g k , t prev( ): The timestamp of the last update on worker , which is also the timestamp of its local model. k k 22

SAMomentum From parameter perspective: u ( i ) Send at and c c + T k u ( i ) k , c + T = mu ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = m ( ( mu ( i ) m ) + η ∇ ( i ) k , c + T − 1 ) * 1 k , c + T − 2 + η ( i ) k , c + T = mu ( i ) k , c + T − 2 + η ∇ ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = ⋯ T ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + i i =1 u ( i ) denotes the -th position of a flattened velocity i u t 23 t

SAMomentum and Enlarged Batch Size SAMomentum Enlarged Batch Size u ( i ) k , c + T = mu ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = m ( ( mu ( i ) k , c + T η * 1 m ) + η ∇ ( i ) k , c + T − 1 ) * 1 T ( ∇ ( i ) k , c + T ) u ( i ) k , c + T = mu ( i ) k , c +1 + ⋯ + ∇ ( i ) k , c + T − 2 + η ( i ) k , c + T T = mu ( i ) k , c + T − 2 + η ∇ ( i ) k , c + T − 1 + η ∇ ( i ) ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + T k , c + i = ⋯ i =1 T ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + i i =1 24

Experiments Setup 1. Comparison to Other Algorithms • Dense: • Single node momentum SGD • Asynchronous SGD • Sparse: • Gradient Dropping (EMNLP 2017) • Deep Gradient Compression (ICLR 2018, STOA) 2. Datasets • ImageNet • CIFAR-10 25

Scalability and Generalization Ability CIFAR-10 Workers Batchsize Training Method Top-1 Accuracy in total per worker 93.08% - MSGD 91.54% -1.54% ASGD 92.15% -0.93% GD-async 1 256 92.75% -0.33% DGC-async 92.97% -0.11% DGS 90.7% -2.38% ASGD 92.01% -1.07% GD-async 4 128 92.64% -0.44% DGC-async 92.91% -0.17% DGS Fig. 4 nodes 90.46% -2.62% ASGD 91.81% -1.27% GD-async 8 64 92.37% -0.71% DGC-async 93.32% +0.24% DGS 90.53% -3.01% ASGD 91.43% -1.65% GD-async 16 32 92.28% -0.80% DGC-async 92.98% -0.10% DGS 88.36% -4.71% ASGD 91% -2.08% GD-async 32 16 91.86% -1.22% DGC-async Fig. 32 nodes 26 92.69% -0.39% DGS

Scalability and Generalization Ability ImageNet 70 Workers in Batchsize per Training Top-1 6 DGS total iteration Method Accuracy 60 ASGD 5 DGCa�� 50 Test Acc�rac� 1 MSGD 69.40% - Train Loss GDa�� 4 40 66.68% 30 3 ASGD -2.72% 20 2 66.26% GD-async 10 1 -3.14% 0 20 40 60 80 20 40 60 80 4 Epochs Epochs 68.37% DGC-async -1.03% Fig. 4 nodes 7 70 69.00% 256 DGS -0.40% DGS 60 6 ASGD 50 66.25% DGCa�� 5 Test Acc�rac� ASGD Train Loss GDa�� -3.15% 40 4 30 66.19% GD-async 3 -3.21% 20 16 2 67.62% 10 DGC-async -1.78% 1 0 20 40 60 80 20 40 60 80 68.25% Epochs Epochs DGS Fig. 16 nodes -1.15% 27

Low Bandwidth Results Fig : Time vs Training Loss on 8 workers with 1Gbps Ethernet 28

Speed up Fig : Speedups for DGS and ASGD on ImageNet with 10Gbps and 1Gbps Ethernet 29

Conclusion and Future Work Conclusion 1. Enable dual-way sparsification for PS-based asynchronous training. 2. Introduce SAMomentum to bring significant optimization. 3. Experiment results show that DGS outperforms existing routing algorithms. Future Work 1. Apply SAMomentum to synchronous training. 2. Combine DGS with other compression approaches. 30

Thanks for listening 31

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep - PowerPoint PPT Presentation

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu Sun Yat-sen University Guangzhou, China 1 Outline 1. Introduction 2. The Proposed

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Vertex Sparsification and Oblivious Reductions Ankur Moitra, MIT September 14, 2010 Ankur Moitra

Graph Sampling and Sparsification Lecture 19 CSCI 4974/6971 7 Nov 2016 1 / 10 Todays Biz 1.

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Lenguaje dual en el distrito 47 Dual Language in District 47 2017-2018 What is Dual Language?

Web Application for the Dual Web Application for the Dual Web Application for the Dual Web

THE 50/50 DUAL LANGUAGE IMMERSION MODEL Learning language through content One-way vs. two-way

Dual Credit Temple College Please pick up a Dual Credit and/or REACH Packet. DO NOT fill

An XML-Format for Conjectures in Geometry (Work-in-Progress) Pedro Quaresma CISUC, Mathematics

Offshoring Bias in Japans Manufacturing Sector Prepared for the Final WIOD Conference: Causes

Corn Plastic to the Rescue Wal-Mart and others are going green w ith "biodegradable"

Some graph optimization problems in data mining P. Van Dooren, CESAME, Univ. catholique Louvain

AB20 CA/CSU/UC Model Agreement Jeff Warner These slides were based on a presentation from:

Directors of Graduate Studies Meeting Minutes Wednesday, February 7, 2018 3:30 p.m.-5:00 p.m.,

RCU Annual Seminar A year in review and future outlook 1 Contents of Presentation DGS

On Basing Search SIVP on NP-Hardness Tianren Liu MIT liutr@mit.edu Sixteenth IACR Theory of

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep - PowerPoint PPT Presentation

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu Sun Yat-sen University Guangzhou, China 1 Outline 1. Introduction 2. The Proposed

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Vertex Sparsification and Oblivious Reductions Ankur Moitra, MIT September 14, 2010 Ankur Moitra

Graph Sampling and Sparsification Lecture 19 CSCI 4974/6971 7 Nov 2016 1 / 10 Todays Biz 1.

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Calhoun Community College Dual Enrollment Info Session for Students &amp; Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Lenguaje dual en el distrito 47 Dual Language in District 47 2017-2018 What is Dual Language?

Web Application for the Dual Web Application for the Dual Web Application for the Dual Web

THE 50/50 DUAL LANGUAGE IMMERSION MODEL Learning language through content One-way vs. two-way

Dual Credit Temple College Please pick up a Dual Credit and/or REACH Packet. DO NOT fill

An XML-Format for Conjectures in Geometry (Work-in-Progress) Pedro Quaresma CISUC, Mathematics

Offshoring Bias in Japans Manufacturing Sector Prepared for the Final WIOD Conference: Causes

Corn Plastic to the Rescue Wal-Mart and others are going green w ith &quot;biodegradable&quot;

Some graph optimization problems in data mining P. Van Dooren, CESAME, Univ. catholique Louvain

AB20 CA/CSU/UC Model Agreement Jeff Warner These slides were based on a presentation from:

Directors of Graduate Studies Meeting Minutes Wednesday, February 7, 2018 3:30 p.m.-5:00 p.m.,

RCU Annual Seminar A year in review and future outlook 1 Contents of Presentation DGS

On Basing Search SIVP on NP-Hardness Tianren Liu MIT liutr@mit.edu Sixteenth IACR Theory of

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

Corn Plastic to the Rescue Wal-Mart and others are going green w ith "biodegradable"