PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch - - PowerPoint PPT Presentation

ppopp 20
SMART_READER_LITE
LIVE PREVIEW

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler Dan Alistarh ETH Zurich IST Austria PPoPP 20 Feb.


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler ETH Zurich Dan Alistarh IST Austria

  • Feb. 22-26, 2020

San Diego, CA, US

PPoPP’20

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

The overall

  • bjective function:

2

Deep learning training Model parallelism Dataset P0 P1 P2

ξ is a data point sampled from a distribution D. F is the loss function. w denotes the model parameters. Training: optimize w to minimize f (using SGD).

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

Deep learning training Pipeline parallelism P0 P1 P2 Dataset

ξ is a data point sampled from a distribution D. F is the loss function. w denotes the model parameters. Training: optimize w to minimize f (using SGD).

The overall

  • bjective function:
slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

4

Deep learning training Data parallelism P0 P1 P2 Dataset

ξ is a data point sampled from a distribution D. F is the loss function. w denotes the model parameters. Training: optimize w to minimize f (using SGD).

Global synchronization using Allreduce The overall

  • bjective function:
slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

▪ Load imbalance on application level

▪ Recurrent Neural Networks (RNN/LSTM/GRU) ▪ Transformers

5

Unbalanced training workloads

(One input multiple outputs) (Multiple inputs multiple outputs) (Multiple inputs

  • ne output)

▪ Load imbalance on system level

▪ Performance variability on multitenant cloud systems ▪ System or network noise

Different types of RNNs Multitenant cloud system Interrupts, daemon, page/cache misses, et al.

Challenge: stragglers dominate the performance.

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

Many-to-one RNN for video classification

h0 fw h1 fw h2 fw h3 hT

… FC1 FC2

0.13 0.14 0.41 0.09 0.13 0.10 Playing Basketball

x1 x2 x3 xT

Backward pass

L(WT) L(W3) L(W2) L(W1)

L(w) RNN:

Workload is proportional to T

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

7

Workload statistics for video classification

(a) Video length distribution for UCF101 dataset

Distribution: 201 ~ 3,410 ms Mean: 1,235 ms Standard deviation: 706 ms Distribution: 29 ~ 1,776 frames Mean: 187 frames Standard deviation: 97 frames

(b) Runtime distribution for the mini- batches to train a LSTM model on P100

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8 [1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia

  • Polosukhin. "Attention is all

you need." In Advances in NeurIPS, pp. 5998-6008. 2017.

Transformer

Distribution: 179 ~ 3,482 ms Mean: 475 ms Standard deviation: 144 ms

Runtime distribution for the mini-batches to train a Transformer model (using WMT16) on P100

知 识 就 是 力 量 。 Knowledge is power power . .

Encoder Decoder

The workload is proportional to input_size * output_size . ? ?

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

▪ Compared with imbalanced applications (e.g., LSTM, Transformer), the load imbalance on cloud servers is relatively light.

9

Training on Cloud

Runtime distribution on Google Cloud with 2xV100 GPUs (batch size=256, ResNet-50 on ImageNet).

Distribution: 399 ~ 1,892 ms Mean: 454 ms Standard deviation: 116 ms

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

Deep learning training is robust

f(g) g g p(g)

1 0.5

1-bit gradients quantization f(g) g f(g) g

Top-k Top-k

Gradients sparsification Gossiping Hidden units dropout P-1 P P+1 Allreduce

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

W(1) W(1) W(2) W(2)

Process0 Processn

W(0) idle idle Time (a) synch-SGD

11

Eager-SGD to solve the load imbalance problem

(b) eager-SGD W(1) W(1) W(2) W(2)

Process0 Processn

W(0) Time

synch-allreduce synch-allreduce partial-allreduce partial-allreduce

Eager-SGD exploits the robustness of the training by allowing allreduce on stale gradients. Communication participants Number of steps for update propagation Consistency mode D-PSGD [1] 2 O(P) synchronous AD-PSGD [2] 1 O(logP) asynchronous eager-SGD P 1 asynchronous Gossip-based SGDs

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

▪ Two phases: the activation and the collective operation

12

Partial Allreduce operations

Allreduce Activation Activation Allreduce

P0 P1 P2 P3 P3 schedule

S0 R0 S1 R1 S2 R2 S3 R3 R0 R1 S1 S0 S2 R2 R3 C0 C1 S3 N1 N0

▪ Asynchronous execution: an auxiliary thread would progress the execution (activation and collective) in the background. ▪ Multiple initiators: the same

  • peration is only executed once

even if we may have multiple initiators, i.e. multiple processes arrive at the same time.

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

▪ Two variants: solo allreduce [3] and majority allreduce. ▪ For solo, at least one process “actively” participates. ▪ For majority, a majority of processes must “actively” participate.

13

Solo allreduce and majority allreduce

Solo allreduce Majority allreduce Initiator The fastest process A randomly specified process Attributes Wait-free Wait for the randomly specified initiator The expectation of the participants Ω(1) Ω(P/2)

[3] Di Girolamo, Salvatore, Pierre Jolivet, Keith D. Underwood, and Torsten Hoefler. "Exploiting offload enabled network interfaces." In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 26-33. IEEE, 2015.

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

14

Implementation eager-SGD based on Tensorflow

1 All- reduce 2 All- reduce All- reduce All- reduce 3 4

Conv-BN Conv-BN-ReLU Conv-BN Max Pool Addition forward pass Conv-BN Conv-BN-ReLU Conv-BN Max Pool Addition backward pass

control dependency

Customized distributed optimizer based on Tensorflow

Eager-SGD utilizes the execution engine of TF to exploit the parallelism in the computation DAG.

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

15

Execution of eager-SGD

Computation thread Communication thread

wt Gt

1

wt

1

P0 P1 Gnull Gt

1

Gt

1

Gt

1

step t

partial-allreduce

sendbuff0 recvbuff0 sendbuff1 recvbuff1

( , ) Gt

1 wt+1 1

U

Gt ( , ) Gt

1

Gt Gt

sendbuff0

wt+1 U

Gt+1

1

=Gt+1

0 +

Gt+1

1

+Gt+1

1

+Gt+1

1

Gt+1

0'

partial-allreduce

step t+1

Gt

sendbuff0 recvbuff0 sendbuff1 recvbuff1

Gt+1

0'

Gt+1

0'

Gt+1

0'

  • 1. Two processes and P1 is faster.
  • 2. P1 finishes the calculation for the

gradients of step t, and triggers partial-

  • allreduce. P0 contributes NULL.
  • 3. P0 finishes step t, and discovers partial-

allreduce is already done. P0 copies the stale gradients to its send buffer.

  • 4. P0 catches up P1 in step t+1. The stale

gradients are combined with the latest gradients, and then commit to partial- allreduce.

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

▪ For a learning rate value , eager-SGD converges after iterations.

16

Convergence of eager-SGD

▪ Note the dependence in 𝜐 (staleness bound) and 𝑄-𝑅 (the number of stale gradients) for iterations T. ▪ Eager-SGD would converge slower if too many stale gradients are used.

Staleness bound The total number of processes The number of processes which contribute the latest gradients

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

▪ CSCS Piz Daint supercomputer. ▪ Cray Aries interconnected network. ▪ Cray MPICH 7.7.2 communication library. ▪ Each node contains a 12-core Intel Xeon E5-2690 CPU, and one NVIDIA Tesla P100 GPU. ▪ We compare eager-SGD with the allreduce-based synch-SGD (Horovod and Deep500), the asynchronous centralized SGD (TF parameter server), and the gossip SGDs (D-PSGD, SGP).

17

Evaluation

Simulated load imbalance (traces on cloud machine) Inherent load imbalance

Table 1. Neural networks used for evaluation

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

18

Hyperplane regression (light load imbalance)

Synch-SGD vs eager-SGD for hyperplane regression using 8 GPUs. "synch/eager-SGD-200/300/400" represent 200/300/400 ms load imbalance injection for 1 out of 8 processes.

▪ Eager-SGD (solo) achieves 1.50x, 1.75x, and 2.01x speedup over synch-SGD (Deep500), respectively. ▪ The loss value is equivalent with synch-SGD (Deep500).

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

19

ResNet-50 on ImageNet (light load imbalance)

Synch-SGD vs eager-SGD for ResNet-50 on ImageNet using 64 GPUs. "synch/eager-SGD- 300/460" represent 300/460 ms load imbalance injection for 4 out of 64 processes.

▪ Eager-SGD (solo) achieves 1.25x and 1.29x speedup

  • ver Deep500, respectively; 1.14x and 1.27x

speedup over Horovod, respectively. Top-1 accuracy is almost equivalent (75.2% vs 75.8%).

0,2 0,4 0,6 0,8 1 1,2 1,4

Asynch-PS D-PSGD SGP eager-SGD Throughput (steps/second)

▪ Eager-SGD (solo) achieves 2.64x, 1.26x, 1.17x over aysnch-PS and gossip-based SGDs (D-PSGD, SGP) respectively.

slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

20

Top-1 test accuracy and runtime for LSTM on UCF101 using 8 GPUs.

eager-SGD (solo) eager-SGD (majority) Speedup over Horovod 1.64x 1.27x Top-1 test accuracy 60.6% on average, up to 70.4% 69.7% on average, up to 72.8%

LSTM on UCF101 (severe load imbalance)

a s s s a u a

s a s a a s a s a a

s s s

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

21

Conclusion

  • 1. Eager-SGD deals with the

imbalanced workloads using partial allreduce operations.

  • 2. Eager-SGD has two

variants, solo and majority.

  • 4. For the future work, we will verify the idea of eager-SGD on model-averaging SGD algorithms.

Questions: shigang.li@inf.ethz.ch

  • 3. Solo allreduce is suitable for light load

imbalance, while majority allreduce works for severe load imbalance.