Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady - - PowerPoint PPT Presentation

parallel scan alg lgorithm
SMART_READER_LITE
LIVE PREVIEW

Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady - - PowerPoint PPT Presentation

Scaling Back-Propagation by Parallel Scan Alg lgorithm Shang Wang 1,2 , Yifan Bai 1 , Gennady Pekhimenko 1,2 1 2 The original PPTX file can be downloaded from here. Executive Summary ry The back-propagation (BP) algorithm is popularly used in


slide-1
SLIDE 1

Scaling Back-Propagation by Parallel Scan Alg lgorithm

Shang Wang1,2, Yifan Bai1, Gennady Pekhimenko1,2

1 2

The original PPTX file can be downloaded from here.

slide-2
SLIDE 2

Executive Summary ry

The back-propagation (BP) algorithm is popularly used in training deep learning (DL) models and implemented in many DL frameworks (e.g., PyTorch and TensorFlow).

2

Problem: BP imposes a strong sequential dependency along layers during the gradient computations.

slide-3
SLIDE 3

Executive Summary ry

The back-propagation (BP) algorithm is popularly used in training deep learning (DL) models and implemented in many DL frameworks (e.g., PyTorch and TensorFlow).

2

Problem: BP imposes a strong sequential dependency along layers during the gradient computations. Key idea: We propose scaling BP by Parallel Scan Algorithm (BPPSA):

  • Reformulate BP into a scan operation.

1 2 3 4 5 6 7 8 1 3 6 10 15 21 28

slide-4
SLIDE 4

Executive Summary ry

The back-propagation (BP) algorithm is popularly used in training deep learning (DL) models and implemented in many DL frameworks (e.g., PyTorch and TensorFlow).

2

Problem: BP imposes a strong sequential dependency along layers during the gradient computations. Key idea: We propose scaling BP by Parallel Scan Algorithm (BPPSA):

  • Reformulate BP into a scan operation.
  • Scaled by a customized parallel algorithm.

Key Results: Θ(log n) vs. Θ(n) steps on parallel systems. Up to 108× backward pass speedup (→ 2.17× overall speedup).

slide-5
SLIDE 5

Back-propagation1 (B (BP) Every rywhere

3

1Rumelhart et al. “Learning representations by back-propagating

errors.”, Nature (1986)

slide-6
SLIDE 6

BP’s Strong Sequential Dependency

Linear

Ԧ 𝑦

𝜶 𝒎

4

𝛼Ԧ

𝑦𝑚 =

𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦

𝑈

𝛼

𝑔( Ԧ 𝑦)𝑚 ReLU

𝜶𝒎

Linear

𝜶𝒎

Loss

𝑚

𝝐𝒈(⦁) 𝝐⦁

𝑼

𝝐𝒈(⦁) 𝝐⦁

𝑼

𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦

Jacobian

slide-7
SLIDE 7

BP’s Strong Sequential Dependency

Linear

Ԧ 𝑦

𝜶 𝒎

4

𝛼Ԧ

𝑦𝑚 =

𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦

𝑈

𝛼

𝑔( Ԧ 𝑦)𝑚

Strong Sequential Dependency along layers.

ReLU

𝜶𝒎

Linear

𝜶𝒎

Loss

𝑚

𝝐𝒈(⦁) 𝝐⦁

𝑼

𝝐𝒈(⦁) 𝝐⦁

𝑼

𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦

Jacobian

slide-8
SLIDE 8

Data Parallel Training

5

Conceptually simple, widely used. Effectively increases the batch size:

  • Generalization gap1
  • Batch size scaling limit2

1Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ICLR (2017) 2Shallue, Christopher J. et al. “Measuring the Effects of Data Parallelism on Neural Network Training.” Journal of Machine Learning Research 20 (2019)

Constraint: The model must fit in

  • ne device.

Respects BP’s strong sequential dependency.

Ԧ 𝑦4 𝑚4 Ԧ 𝑦3 𝑚3 Ԧ 𝑦2 𝑚2 Ԧ 𝑦1 𝑚1 Ԧ 𝑦i 𝑚i

Strong Sequential Dependency Strong Sequential Dependency Strong Sequential Dependency

slide-9
SLIDE 9

Model Parallel Training

6

Used when the model cannot fit in one device.

1Harlap, Aaron et al. “PipeDream: Fast and Efficient Pipeline Parallel DNN Training.” SOSP (2019) 2Huang, Yanping et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.” NeurIPS (2019)

Prior works on pipeline parallel training1,2 to mitigate such problem, but have their own limitations:

  • Linear per-device space complexity.
  • Trade-off between “bubble of idleness” vs. potential convergence affect.

Conv Conv

Linear

BP’s strong sequential dependency limits scalability.

𝜶𝒋−𝟑𝒎 𝜶𝒋−𝟐𝒎 𝜶𝒋𝒎 𝜶𝒋+𝟐𝒎

slide-10
SLIDE 10

Rethinking BP fr from an Algorithm Perspective

7

slide-11
SLIDE 11

Rethinking BP fr from an Algorithm Perspective

7

  • Problems with strong sequential dependency were studied in the past

(80’), but in a much simpler context.

  • We propose scaling Back-Propagation by Parallel Scan Algorithm (BPPSA):
  • Reformulate BP as a scan operation.
  • Scale BP by a customized Blelloch Scan algorithm.
  • Leverage sparsity in the Jacobians.
slide-12
SLIDE 12

What is a Scan1 Operation?

Input sequence: Binary, associative operator: + Exclusive scan:

1 2 3 4 5 6 7 8 1 3

8

6 10 15 21 28

Compute partial reductions at each step of the sequence. Identity: 0

1Blelloch, Guy E. ”Prefix sums and their applications”. Technical Report (1990)

slide-13
SLIDE 13

What is a Scan1 Operation?

Input sequence: Binary, associative operator: + Exclusive scan:

1 2 3 4 5 6 7 8 1 3

8

6 10 15 21 28

Compute partial reductions at each step of the sequence. Identity: 0

1Blelloch, Guy E. ”Prefix sums and their applications”. Technical Report (1990)

slide-14
SLIDE 14

Linear Scan

9

1 2 3 4 5 6 7 3 6 10 15 21 28 On a single worker: perform scan linearly; takes n steps. Worker (p): an instance of execution; e.g., a core in a multi-core CPU Number of Elements (n) With more workers: Can we achieve sublinear steps?

Time

n Step: executing the

  • perator once.
slide-15
SLIDE 15

Blelloch Scan: : ①Up-sweep Phase

10

1 2 3 4 5 6 7 8 3 7 11 15 10 26

A B

A+B

Up-sweep

Compute partial sums via a reduction tree.

Time

slide-16
SLIDE 16

Blelloch Scan: : ② Down-sweep Phase

11

1 2 3 4 5 6 7 8 3 7 11 15 10 26 10 3 11 10 1 3 3 5 10 7 21 1 3 6 10 15 21

A B B

A+B

Down-sweep

Combine partial sums across branches.

Time

Parallel 28

slide-17
SLIDE 17

Blelloch Scan: Efficiency

12

2logn Logarithmic steps along the critical path. 1 2 3 4 5 6 7 8 3 7 11 15 10 26 10 3 11 10 1 3 3 5 10 7 21 1 3 6 10 15 21

Time

28

slide-18
SLIDE 18

1 3 6 10 15 21 28 1 2 3 4 5 6 7 8

Reformulate BP as a S Scan Operation

Input sequence: Binary, associative operator: Exclusive scan:

13

Key Insight: matrix multiplication in BP is also binary & associative! + Identity:

slide-19
SLIDE 19

I G7 G6 G5 G4 G3 G2 G1 G7 J7 J6 J5 J4 J3 J2 J1

Reformulate BP as a S Scan Operation

Input sequence: Binary, associative operator: Exclusive scan:

13

A ◊ B = BA

Gi = 𝜶𝒚𝒋𝒎 Ji+1

i+1 =

𝝐𝒚𝒋+𝟐 𝝐𝒚𝒋 𝑼

Key Insight: matrix multiplication in BP is also binary & associative! Identity: I

slide-20
SLIDE 20

Scale BP by Ble lelloch Scan

2logn Logarithmic steps along the critical path! 1 2 3 4 5 6 7 8 3 7 11 15 10 26 10 3 11 10 1 3 3 5 10 7 21 1 3 6 10 15 21 28

Time

slide-21
SLIDE 21

I I I I

Scale BP by Ble lelloch Scan

2logn Logarithmic steps along the critical path! G7 J7 J6 J5 J4 J3 J2 J1 G6 J5:6 J3:4 J1:2 G4 J1:4 G4 G6 J3:4 G4 G7 J6 G6 J4 G4 J2 G2 G7 G6 G5 G4 G3 G2

Time

G1

slide-22
SLIDE 22

I I I I

Scale BP by Ble lelloch Scan

2logn Logarithmic steps along the critical path! G7 J7 J6 J5 J4 J3 J2 J1 G6 J5:6 J3:4 J1:2 G4 J1:4 G4 G6 J3:4 G4 G7 J6 G6 J4 G4 J2 G2 G7 G6 G5 G4 G3 G2

Time A B B BA Down-sweep AB

Matrix multiplications are noncommutative.

G1

slide-23
SLIDE 23

I I I I

Scale BP by Ble lelloch Scan

2logn Logarithmic steps along the critical path! G7 J7 J6 J5 J4 J3 J2 J1 G6 J5:6 J3:4 J1:2 G4 J1:4 G4 G6 J3:4 G4 G7 J6 G6 J4 G4 J2 G2 G7 G6 G5 G4 G3 G2

Time A B B BA Down-sweep

Matrix multiplications are noncommutative.

G1

slide-24
SLIDE 24

Reconstructs the Original BP Exactly

Training LeNet-5 on CIFAR-10 (baseline: PyTorch Autograd)

15

Our method produces gradients mathematically equivalent to BP. The Jacobians are multiplied in a different order → numerical differences. Empirically show that such differences do not effect convergence.

slide-25
SLIDE 25

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.

16

𝑔 Ԧ 𝑦 Ԧ 𝑦 𝜖𝑔( Ԧ 𝑦) 𝜖 Ԧ 𝑦

65536 3072 768 MB

slide-26
SLIDE 26

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.
  • Generated one row at a time by passing basis vectors into Op_Grad() (the VJP

function).

16

𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 )

slide-27
SLIDE 27

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.
  • Generated one row at a time by passing basis vectors into Op_Grad() (the VJP

function).

16

𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 )

slide-28
SLIDE 28

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.
  • Generated one row at a time by passing basis vectors into Op_Grad() (the VJP

function).

16

𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 )

slide-29
SLIDE 29

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.
  • Generated one row at a time by passing basis vectors into Op_Grad() (the VJP

function).

16

𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 )

slide-30
SLIDE 30

Ja Jacobians are Memory ry & Compute Hungry ry

A full Jacobian can be prohibitively expensive to handle.

  • e.g., 1st convolution in VGG-11 on CIFAR-10 images occupy 768 MB of memory.
  • Generated one row at a time by passing basis vectors into Op_Grad() (the VJP

function).

Conventional ML algorithms avoid using Jacobians directly (including BP).

16

Ԧ 𝑦

3072 768 MB

𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 ) 𝑫𝒑𝒐𝒘𝟑𝒆_𝑯𝒔𝒃𝒆( 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 )

slide-31
SLIDE 31

The Ja Jacobians of f Many Operators are Sparse

17

Guaranteed Zeros Possible Zeros Non-zeros

Conv2d ReLU MaxPool2D

First three ops of VGG-11 on CIFAR-10 Convolution ReLU Max Pooling Sparsity 0.99157 0.99998 0.99994

Deterministic pattern. Potentially better SpGEMM performance. Guaranteed zeros: Known ahead of training time.

slide-32
SLIDE 32

Fast Sparse Ja Jacobians Generation

Therefore, instead of calculating the Jacobians row-wise, generate directly into Compressed Sparse Row (CSR):

1 2 3 1 2 2 4

Conv2d, W

data indices indptr

First three ops of VGG-11 on CIFAR-10 Convolution ReLU Max Pooling Jacobian Calculation Speedup 8.3×103 x 1.2×106 x 1.5×105 x

18

slide-33
SLIDE 33

Complexity Analysis

19

Θ(log n) CBP Θ(n) vs. BPPSA BP

Per-step Complexity (C): runtime of each step.

CBPPSA

Runtime:

slide-34
SLIDE 34

Complexity Analysis

Performance benefits:

1. Large n: deep network, long sequential dependency.

19

Θ(log n) CBP Θ(n) vs. BPPSA BP

Per-step Complexity (C): runtime of each step.

CBPPSA

Runtime:

slide-35
SLIDE 35

Complexity Analysis

Performance benefits:

1. Large n: deep network, long sequential dependency. 2. Reducing per-step complexity: SpGEMM.

19

Θ(log n) CBP Θ(n) vs. BPPSA BP

Per-step Complexity (C): runtime of each step.

CBPPSA

Runtime:

slide-36
SLIDE 36

Complexity Analysis

Performance benefits:

1. Large n: deep network, long sequential dependency. 2. Reducing per-step complexity: SpGEMM.

19

Constant per-device space complexity!

Θ(log n) CBP Θ(n) vs. BPPSA BP

A B BA Up-sweep A B B AB Down-sweep

In-place

Per-step Complexity (C): runtime of each step.

CBPPSA

Runtime:

slide-37
SLIDE 37

Methodology: Benchmark

Task: Bitstream Classification

20

Model: RNN

V100

ℎ𝑢

𝑙 = tanh 𝑋 𝑗ℎ𝑦𝑢 𝑙 + 𝑐𝑗ℎ + 𝑋 ℎℎℎ𝑢−1 𝑙 + 𝑐ℎℎ

1 1 1 1 C=4 𝑦𝑢

𝑙 ~ 𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(0.05 + 𝐷 𝑙 × 0.1)

slide-38
SLIDE 38

Methodology: Environment

21

Baseline: Implementation: custom CUDA 10 kernels. Hardware: RTX 2070 1.2 7.6.2 RTX 2080 Ti 7.5.1 1.1

slide-39
SLIDE 39

1.2 1.4 1.6 1.8 2 2.2 2.4 1000 2000 3000 4000 5000

Training Loss Wall-clock Time (s)

Baseline BPPSA

End-to to-end Training Speedup

Training curve of BPPSA v.s. the baseline when batch size B=16, sequence length T=1000:

22

Numerical differences do not effect convergence.

slide-40
SLIDE 40

1.2 1.4 1.6 1.8 2 2.2 2.4 1000 2000 3000 4000 5000

Training Loss Wall-clock Time (s)

Baseline BPPSA

End-to to-end Training Speedup

Training curve of BPPSA v.s. the baseline when batch size B=16, sequence length T=1000:

22

Numerical differences do not effect convergence. 2.17× speedup on the overall training time.

slide-41
SLIDE 41

1 10 100 10 30 100 300 1k 3k 10k 30k

Speedup Sequence Length (T) Backward Pass Speedup over Baseline

Sensitivity Analysis: Model Length

23

Sequence length (T) reflects the model length n. BPPSA scales with the model length (n);

slide-42
SLIDE 42

1 10 100 10 30 100 300 1k 3k 10k 30k

Speedup Sequence Length (T) Backward Pass Speedup over Baseline

Sensitivity Analysis: Model Length

23

Sequence length (T) reflects the model length n. BPPSA scales with the model length (n); until being bounded by the number of workers (p).

slide-43
SLIDE 43

1 10 100 10 30 100 300 1k 3k 10k 30k

Speedup Sequence Length (T) Backward Pass Speedup over Baseline

Sensitivity Analysis: Model Length

23

Sequence length (T) reflects the model length n. BPPSA scales with the model length (n);

108× =

until being bounded by the number of workers (p).

slide-44
SLIDE 44

Sensitivity Analysis: Number of f Workers

24

Fraction of GPU per sample (1/B) reflects the number of workers p. BPPSA scales with the number of workers (p).

1 2 4 8 16

1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2

Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline

slide-45
SLIDE 45

Sensitivity Analysis: Number of f Workers

24

Fraction of GPU per sample (1/B) reflects the number of workers p. BPPSA scales with the number of workers (p).

1 2 4 8 16

1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2

Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline

slide-46
SLIDE 46

Sensitivity Analysis: Number of f Workers

24

Fraction of GPU per sample (1/B) reflects the number of workers p. BPPSA scales with the number of workers (p).

1 2 4 8 16

1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2

Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline

slide-47
SLIDE 47

Sensitivity Analysis: Number of f Workers

24

Fraction of GPU per sample (1/B) reflects the number of workers p. BPPSA scales with the number of workers (p).

1 2 4 8 16

1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2

Speedup Fraction of GPU per Sample (1/B) Backward Pass Speedup over Baseline

slide-48
SLIDE 48

25

5 10 15 20 25 30

1/256 1/128 1/64 1/32 1/16 1/8 1/4 1/2

Latency (ms) per Iteration Fraction of GPU per Sample (1/B)

5 10 15 20 25 30 35 40 10 30 100 300 1k 3k 10k 30k

Latency (ms) per Iteration Sequence Length (T)

2070 2080Ti

Sensitivity Analysis: 2070 v.s .s. . 2080Ti

#SMs(2070) < #SMs(2080Ti) → Latency(2070) > Latency(2080Ti)

SM: Streaming Multiprocessor; i.e., “Parallel Cores”.

slide-49
SLIDE 49

More Results in the Paper

  • End-to-end benchmarks of GRU training on IRMAS.
  • A more realistic version of the RNN results.
  • Pruned VGG-11 retraining on CIFAR-10.
  • Microbenchmark via FLOP measurements.
  • Evaluate the effectiveness of leveraging the Jacobians’ sparsity in CNNs.

26

slide-50
SLIDE 50

Conclusion

27

BP imposes a strong sequential dependency among layers during the gradient computations, limiting its scalability on parallel systems. We propose scaling Back-Propagation by Parallel Scan Algorithm (BPPSA):

  • Reformulate BP as a scan operation.
  • Scale by a customized Blelloch scan algorithm.
  • Leverage sparsity in the Jacobians.

Key Results: Θ(log n) vs. Θ(n) steps on parallel systems. Up to 108× speedup on the backward pass (→ 2.17× overall speedup).