Dual-Way Gradient Sparsification for Asynchronous Distributed Deep - - PowerPoint PPT Presentation

dual way gradient sparsification for asynchronous
SMART_READER_LITE
LIVE PREVIEW

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep - - PowerPoint PPT Presentation

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu Sun Yat-sen University Guangzhou, China 1 Outline 1. Introduction 2. The Proposed


slide-1
SLIDE 1


 Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning

Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu Sun Yat-sen University Guangzhou, China

1

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. The Proposed Algorithm
  • 3. Performance Evaluation
  • 4. Conclusion and Future Work

2

slide-3
SLIDE 3

Introduction

3

  • Training may take an impractically long time
  • Growing volume of training data (e.g., ImageNet >1TB)
  • More complex model
  • Solution: distributed training
  • The common practice of current DL frameworks
  • Enabled by Parameters Servers (PS) or Ring All-Reduce
  • Synchronous SGD or Asynchronous SGD
slide-4
SLIDE 4

Introduction

4

slide-5
SLIDE 5

Introduction

Communication overhead:Distributed training can significantly

reduce the total computation time. However, the communication

  • verhead seriously affect the efficiency of training.

5

Solutions:Reduce the frequency / data size of communication.

slide-6
SLIDE 6

Introduction

  • Gradient Quantization
  • 1-Bit SGD, QSGD, TernGrad: use fewer bits to represent value.
  • Gradient Sparsification
  • Threshold Sparsification: only gradients which are greater than a

predefined threshold will be sent.

  • Gradient Dropping: remove the gradient of R% with the smallest

absolute value.

  • Deep Gradient Compression: apply momentum correction to

correct the disappearance of momentum discounting factor.

6

slide-7
SLIDE 7

Introduction

7

Server Worker 1 Worker 2

Local Model Global Model Parameters Global Model Parameters Gradients

……

upward downward

PS based ASGD

slide-8
SLIDE 8

Contributions

  • Dual-Way Gradient Sparsification (DGS)
  • Model Difference Tracking
  • Dual-way Gradient Sparsification Operations
  • Eliminates the communication bottleneck
  • Sparsification Aware Momentum (SAMomentum)
  • A novel momentum designed for gradient sparsification

scenario

  • Offers significant optimization

8

slide-9
SLIDE 9

9

Server Worker 1 Worker 2

Local Model Global Model

Compressed Gradients

……

Local Model

Model Difference

upward downward

DGS

SAMomentum

Dual-way Gradient Sparsification Model Difference Tracking Model Difference Tracking

Contributions

slide-10
SLIDE 10
  • Notions
  • : The accumulation of updates at the time .
  • : Model difference between the server and the worker .
  • : Accumulation of model difference sent by the server to the worker .

Mt t Gk,t k vk k

Model Difference Tracking

10

Server receives gradients ∇𝑙, 𝑢

Update , accumulate gradients:

Mt 𝑁𝑢+1 = 𝑁𝑢 − 𝜃∇𝑙, t

Calculation of the model difference:𝐻𝑙, 𝑢+1 = 𝑁𝑢+1 − 𝑤𝑙, 𝑢

Accumulation of :

vk v𝑙, 𝑢+1 = v𝑙, 𝑢 + 𝐻𝑙, 𝑢+1

Send model difference 𝐻𝑙, 𝑢+1

slide-11
SLIDE 11

Model Difference Tracking

11

  • What's changed?
  • DGS chooses to transmits model difference

rather than the global model

  • Model differences (residual gradients) that have not sent yet are recorded in

, implicitly avoiding the loss of information.

  • Now we can compress the downward communication!

Gk,t+1 𝑁𝑢+1 − v𝑙, 𝑢

slide-12
SLIDE 12

12

Dual-way Gradient Sparsification

Server Worker 1 Worker 2

Local Model Global Model

Compressed Gradients

Local Model

Model Difference

upward downward

Worker side Server side

slide-13
SLIDE 13

13

Dual-way Gradient Sparsification - Worker Side

Select threshold

slide-14
SLIDE 14

14

Dual-way Gradient Sparsification - Worker Side

Sparsification

slide-15
SLIDE 15

15

Dual-way Gradient Sparsification - Server Side

Model Difference Tracking

slide-16
SLIDE 16

16

Dual-way Gradient Sparsification - Server Side

Secondary compression

  • Secondary compression guarantees the

sparsity of the send-ready model difference in downward communication, no matter how many workers are running.

  • The server implicitly accumulates remaining

gradient locally.

  • Eliminates the overhead of the downward

communication.

slide-17
SLIDE 17

17

Server

Worker 1

Worker 2

Local Model

Global

Compress

……

Local

Model Difference

upward

downward

DGS

SAMomentum

Dual-way Gradient Sparsification Model Difference Tracking Eliminates the communication bottleneck Bring optimization boost

Sparsification Aware Momentum

slide-18
SLIDE 18

18

SAMomentum - Background

  • Momentum is commonly used in deep training, which is

known to offer a significant optimization boost.

  • However, indeterminate update intervals in gradient

sparsification will result in the disappearance of momentum.

slide-19
SLIDE 19

19

SAMomentum - Background

ut = mut−1 + η∇t, θt+1 = θt − ut

After updates

T Dense u(i)

t+T = η [⋯ + mT−2∇(i) t+2 + mT−1∇(i) t+1] + mTu(i) t

Dense update :

denotes the -th position of a flattened velocity

u(i)

t

i ut

slide-20
SLIDE 20

20

SAMomentum - Background

rk,t = rk,t−1 + η * ∇k,t, ut = mut−1 + sparsify (rk,t)

After updates

T Sparse u(i)

t+T = η [⋯ + ∇(i) t+2 + ∇(i) t+1] + mTu(i) t

Sparse update :

denotes the -th position of a flattened velocity

u(i)

t

i ut

rk,t = unsparsify (rk,t), θt+1 = θt − ut

Remaining Gradients

slide-21
SLIDE 21

21

Momentum Disappearing

Dense u(i)

t+T = η [⋯ + mT−2∇(i) t+2 + mT−1∇(i) t+1] + mTu(i) t

Sparse u(i)

t+T = η [⋯ + ∇(i) t+2 + ∇(i) t+1] + mTu(i) t

  • Momentum factor controls the proportion of historical information.
  • The disappearance of impairs the convergence performance.

m m

slide-22
SLIDE 22

22

SAMomentum

prev( ): The timestamp of the last update on worker , which is also the timestamp of its local model.

k k

uk,t = muk,prev(k) + η∇k,t + unsparsify (muk,prev(k) + η∇k,t) * ( 1 m − 1) gk,t = sparsify (muk,prev(k) + η∇k,t) θt+1 = θt − gk,t

slide-23
SLIDE 23

23

SAMomentum

From parameter perspective: u(i)

k,c+T = mu(i) k,c+T−1 + η∇(i) k,c+T

= m ((mu(i)

k,c+T−2 + η(i) k,c+T−1) * 1

m) + η∇(i)

k,c+T

= mu(i)

k,c+T−2 + η∇(i) k,c+T−1 + η∇(i) k,c+T

= ⋯ = mu(i)

k,c + η T

i=1

∇(i)

k,c+i

Send at and

u(i)

k

c c + T

denotes the -th position of a flattened velocity

u(i)

t

i ut

slide-24
SLIDE 24

24

SAMomentum and Enlarged Batch Size

u(i)

k,c+T = mu(i) k,c+T−1 + η∇(i) k,c+T

= m ((mu(i)

k,c+T−2 + η(i) k,c+T−1) * 1

m ) + η∇(i)

k,c+T

= mu(i)

k,c+T−2 + η∇(i) k,c+T−1 + η∇(i) k,c+T

= ⋯ = mu(i)

k,c + η T

i=1

∇(i)

k,c+i

SAMomentum Enlarged Batch Size

u(i)

k,c+T = mu(i) k,c + Tη * 1

T (∇(i)

k,c+1 + ⋯ + ∇(i) k,c+T)

= mu(i)

k,c + η T

i=1

∇(i)

k,c+i

slide-25
SLIDE 25

25

Experiments Setup

  • 1. Comparison to Other Algorithms
  • Dense:
  • Single node momentum SGD
  • Asynchronous SGD
  • Sparse:
  • Gradient Dropping (EMNLP 2017)
  • Deep Gradient Compression (ICLR 2018, STOA)
  • 2. Datasets
  • ImageNet
  • CIFAR-10
slide-26
SLIDE 26

26

Scalability and Generalization Ability

Workers in total Batchsize per worker Training Method Top-1 Accuracy

1

256 MSGD 93.08% - ASGD 91.54% -1.54% GD-async 92.15% -0.93% DGC-async 92.75% -0.33% DGS 92.97% -0.11% 4 128 ASGD 90.7% -2.38% GD-async 92.01% -1.07% DGC-async 92.64% -0.44% DGS 92.91% -0.17% 8 64 ASGD 90.46% -2.62% GD-async 91.81% -1.27% DGC-async 92.37% -0.71% DGS 93.32% +0.24% 16 32 ASGD 90.53% -3.01% GD-async 91.43% -1.65% DGC-async 92.28% -0.80% DGS 92.98% -0.10% 32 16 ASGD 88.36% -4.71% GD-async 91% -2.08% DGC-async 91.86% -1.22% DGS 92.69% -0.39%

CIFAR-10

  • Fig. 32 nodes
  • Fig. 4 nodes
slide-27
SLIDE 27

27

Scalability and Generalization Ability

ImageNet

  • Fig. 16 nodes
  • Fig. 4 nodes

Workers in total Batchsize per iteration Training Method Top-1 Accuracy 1 256 MSGD 69.40% - 4 ASGD 66.68%

  • 2.72%

GD-async 66.26%

  • 3.14%

DGC-async 68.37%

  • 1.03%

DGS 69.00%

  • 0.40%

16 ASGD 66.25%

  • 3.15%

GD-async 66.19%

  • 3.21%

DGC-async 67.62%

  • 1.78%

DGS 68.25%

  • 1.15%

20 40 60 80 1 2 3 4 5 6 20 40 60 80 10 20 30 40 50 60 70 DGS ASGD DGC­a GD­a

Epochs Epochs Train Loss Test Accrac

20 40 60 80 1 2 3 4 5 6 7 20 40 60 80 10 20 30 40 50 60 70 DGS ASGD DGC­a GD­a

Epochs Epochs Train Loss Test Accrac

slide-28
SLIDE 28

28

Low Bandwidth Results

Fig : Time vs Training Loss on 8 workers with 1Gbps Ethernet

slide-29
SLIDE 29

29

Speed up

Fig : Speedups for DGS and ASGD on ImageNet with 10Gbps and 1Gbps Ethernet

slide-30
SLIDE 30

Conclusion and Future Work

Conclusion

  • 1. Enable dual-way sparsification for PS-based asynchronous

training.

  • 2. Introduce SAMomentum to bring significant optimization.
  • 3. Experiment results show that DGS outperforms existing

routing algorithms.

Future Work

  • 1. Apply SAMomentum to synchronous training.
  • 2. Combine DGS with other compression approaches.

30

slide-31
SLIDE 31

Thanks for listening

31