Scaling Back-Propagation by Parallel Scan Alg lgorithm
Shang Wang1,2, Yifan Bai1, Gennady Pekhimenko1,2
1 2
The original PPTX file can be downloaded from here.
Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady - - PowerPoint PPT Presentation
Scaling Back-Propagation by Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady Pekhimenko 1,2 1 2 The original PPTX file can be downloaded from here. Executive Summary ry The back-propagation (BP) algorithm is popularly used in
Shang Wang1,2, Yifan Bai1, Gennady Pekhimenko1,2
1 2
The original PPTX file can be downloaded from here.
2
Problem: BP imposes a strong sequential dependency along layers during the gradient computations.
2
Problem: BP imposes a strong sequential dependency along layers during the gradient computations. Key idea: We propose scaling BP by Parallel Scan Algorithm (BPPSA):
1 2 3 4 5 6 7 8 1 3 6 10 15 21 28
2
Problem: BP imposes a strong sequential dependency along layers during the gradient computations. Key idea: We propose scaling BP by Parallel Scan Algorithm (BPPSA):
Key Results: Θ(log n) vs. Θ(n) steps on parallel systems. Up to 108× backward pass speedup (→ 2.17× overall speedup).
3
1Rumelhart et al. “Learning representations by back-propagating
errors.”, Nature (1986)
Linear
4
𝑦𝑚 =
𝑈
𝑔( Ԧ 𝑦)𝑚 ReLU
Linear
𝜶𝒎
Loss
𝝐𝒈(⦁) 𝝐⦁
𝑼
𝝐𝒈(⦁) 𝝐⦁
𝑼
𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦
Jacobian
Linear
4
𝑦𝑚 =
𝑈
𝑔( Ԧ 𝑦)𝑚
ReLU
Linear
𝜶𝒎
Loss
𝝐𝒈(⦁) 𝝐⦁
𝑼
𝝐𝒈(⦁) 𝝐⦁
𝑼
𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦
Jacobian
5
1Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ICLR (2017) 2Shallue, Christopher J. et al. “Measuring the Effects of Data Parallelism on Neural Network Training.” Journal of Machine Learning Research 20 (2019)
Ԧ 𝑦4 𝑚4 Ԧ 𝑦3 𝑚3 Ԧ 𝑦2 𝑚2 Ԧ 𝑦1 𝑚1 Ԧ 𝑦i 𝑚i
Strong Sequential Dependency Strong Sequential Dependency Strong Sequential Dependency
6
1Harlap, Aaron et al. “PipeDream: Fast and Efficient Pipeline Parallel DNN Training.” SOSP (2019) 2Huang, Yanping et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.” NeurIPS (2019)
Conv Conv
Linear
𝜶𝒋−𝟑𝒎 𝜶𝒋−𝟐𝒎 𝜶𝒋𝒎 𝜶𝒋+𝟐𝒎
7
7
1 2 3 4 5 6 7 8 1 3
8
6 10 15 21 28
1Blelloch, Guy E. ”Prefix sums and their applications”. Technical Report (1990)
1 2 3 4 5 6 7 8 1 3
8
6 10 15 21 28
1Blelloch, Guy E. ”Prefix sums and their applications”. Technical Report (1990)
9
Time
10
A B
A+B
Up-sweep
Time
11
A B B
A+B
Down-sweep
Time
12
Time
13
I G7 G6 G5 G4 G3 G2 G1 G7 J7 J6 J5 J4 J3 J2 J1
13
Gi = 𝜶𝒚𝒋𝒎 Ji+1
i+1 =
𝝐𝒚𝒋+𝟐 𝝐𝒚𝒋 𝑼
Time
Time
Time A B B BA Down-sweep AB
Matrix multiplications are noncommutative.
Time A B B BA Down-sweep
Matrix multiplications are noncommutative.
Training LeNet-5 on CIFAR-10 (baseline: PyTorch Autograd)
15
16
𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦
16
𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 )
16
𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 )
16
𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 )
16
𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 )
16
Ԧ 𝑦
3072 768 MB
𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 )
17
Guaranteed Zeros Possible Zeros Non-zeros
Conv2d ReLU MaxPool2D
First three ops of VGG-11 on CIFAR-10 Convolution ReLU Max Pooling Sparsity 0.99157 0.99998 0.99994
1 2 3 1 2 2 4
data indices indptr
First three ops of VGG-11 on CIFAR-10 Convolution ReLU Max Pooling Jacobian Calculation Speedup 8.3×103 x 1.2×106 x 1.5×105 x
18
19
Per-step Complexity (C): runtime of each step.
19
Per-step Complexity (C): runtime of each step.
19
Per-step Complexity (C): runtime of each step.
19
A B BA Up-sweep A B B AB Down-sweep
Per-step Complexity (C): runtime of each step.
20
V100
ℎ𝑢
𝑙 = tanh 𝑋 𝑗ℎ𝑦𝑢 𝑙 + 𝑐𝑗ℎ + 𝑋 ℎℎℎ𝑢−1 𝑙 + 𝑐ℎℎ
1 1 1 1 C=4 𝑦𝑢
𝑙 ~ 𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(0.05 + 𝐷 𝑙 × 0.1)
21
1.2 1.4 1.6 1.8 2 2.2 2.4 1000 2000 3000 4000 5000
Training Loss Wall-clock Time (s)
Baseline BPPSA
22
Numerical differences do not effect convergence.
1.2 1.4 1.6 1.8 2 2.2 2.4 1000 2000 3000 4000 5000
Training Loss Wall-clock Time (s)
Baseline BPPSA
22
Numerical differences do not effect convergence. 2.17× speedup on the overall training time.
1 10 100 10 30 100 300 1k 3k 10k 30k
Speedup Sequence Length (T) Backward Pass Speedup over Baseline
23
1 10 100 10 30 100 300 1k 3k 10k 30k
Speedup Sequence Length (T) Backward Pass Speedup over Baseline
23
1 10 100 10 30 100 300 1k 3k 10k 30k
Speedup Sequence Length (T) Backward Pass Speedup over Baseline
23
24
1 2 4 8 16
1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2
Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline
24
1 2 4 8 16
1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2
Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline
24
1 2 4 8 16
1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2
Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline
24
1 2 4 8 16
1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2
Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline
25
5 10 15 20 25 30
1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2
Latency (ms) per Iteration Fraction of GPU per Sample (1/B)
5 10 15 20 25 30 35 40 10 30 100 300 1k 3k 10k 30k
Latency (ms) per Iteration Sequence Length (T)
2070 2080Ti
SM: Streaming Multiprocessor; i.e., “Parallel Cores”.
26
27