Accelerating Deep Learning Frameworks with Micro-batches Yosuke - - PowerPoint PPT Presentation

accelerating deep learning frameworks with micro batches
SMART_READER_LITE
LIVE PREVIEW

Accelerating Deep Learning Frameworks with Micro-batches Yosuke - - PowerPoint PPT Presentation

Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 *


slide-1
SLIDE 1

Accelerating Deep Learning Frameworks with Micro-batches

Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018

1Tokyo Institute of Technology 2ETH Zurich 3RIKEN Center for Computational Science *oyama.y.aa@m.titech.ac.jp, Presenter

1/26

slide-2
SLIDE 2

Background

slide-3
SLIDE 3

Background µ-cuDNN Performance evaluation

Background

  • Convolution is one of the key operations in Convolutional Neural

Networks (CNNs)

X W H C Y W ′ H′ C′ W U V C

Figure 1: 2D convolution.

Algorithm 1 Pseudo-code of two-dimensional convolution.

1: for(n = 0; n < N; n++)

// Mini-batch loop

2:

for(k = 0; k < K; k++) // Output channel loop

3:

for(h = 0; h < H; h++) // Height loop

4:

for(w = 0; w < W; w++) // Width loop

5:

for(c = 0; c < C; c++) // Input channel loop

6:

for(v = 0; v < V ; v++) // Kernel width loop

7:

for(u = 0; u < U; u++) // Kernel height loop

8:

Y[n, k, h, w] += W[k, c, v, u] × X[n, c, h + v, w + u];

2/26

slide-4
SLIDE 4

Background µ-cuDNN Performance evaluation

Background

  • NVIDIA cuDNN library provides deep learning primitives for GPUs
  • cuDNN provides several equivalent convolution algorithms

GEMM-based X W Y X′ W Y im2col · = FFT-based X W Y Frequency domain ˆ X ˆ W ˆ Y F F F−1

  • =

Wingorad X W Y Winograd domain ˜ X ˜ W ˜ Y BT G AT

  • =

: Workspace

Figure 2: Three different convolution algorithms.

3/26

slide-5
SLIDE 5

Background µ-cuDNN Performance evaluation

Background

  • Problem statement: cuDNN may require a workspace as large as

the network itself to use efficient convolution algorithms!

Memory [MiB] 200 400 600 800 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 conv fc (8 MiB) conv fc (64 MiB) conv fc (512 MiB) Data Weights Workspace 50 100 150 200 250 300 Time [ms]

IMPLICIT PRECOMP GEMM FFT FFT TILING WINOGRAD (Total time)

Figure 3: Memory consumption (bars) and computation time (line/points) of AlexNet on P100-SXM2 with different workspace sizes (8, 64, 512 MiB).

4/26

slide-6
SLIDE 6

Background µ-cuDNN Performance evaluation

Background

  • Idea: Loop splitting for the convolution’s outermost loop decreases

workspace size (as well as computation efficiency) Algorithm 2 Pseudo-code of two-dimensional convolution.

1: for(n = 0; n < N; n++)

// Mini-batch loop

2:

for(k = 0; k < K; k++) // Output channel loop

3:

for(h = 0; h < H; h++) // Height loop

4:

for(w = 0; w < W; w++) // Width loop

5:

for(c = 0; c < C; c++) // Input channel loop

6:

for(v = 0; v < V ; v++) // Kernel width loop

7:

for(u = 0; u < U; u++) // Kernel height loop

8:

Y[n, k, h, w] += W[k, c, v, u] × X[n, c, h + v, w + u];

5/26

slide-7
SLIDE 7

Background µ-cuDNN Performance evaluation

Approach and Contribution

  • Approach: µ-cuDNN, a thin wrapper library for cuDNN, which
  • divides a mini-batch into “micro-batches” by applying loop splitting
  • is based on Dynamic Programming (DP) and Integer Lienar

Programming (ILP)

  • provides a Python interface for high-level optimization

Time conv1

N = 256

relu1

N = 256

pool1

N = 256

conv2

N = 256

conv1

N = 128

conv1

N = 128

relu1

N = 256

pool1

N = 256

conv2

N = 64

cuDNN µ-cuDNN

Using GEMM-based convolution Using FFT-based convolution

  • Contribution:
  • 1.60x speedup for AlexNet on V100-SXM2 GPU
  • up to 4.54x speedup (1.60x on average) for DeepBench on

V100-SXM2 GPU

6/26

slide-8
SLIDE 8

µ-cuDNN

slide-9
SLIDE 9

Background µ-cuDNN Performance evaluation

µ-cuDNN - Software stack

  • µ-cuDNN is a wrapper library for cuDNN, which can be called by
  • 1. a DL framework as low-level performance tuning library
  • 2. its dedicated Python frontend for high-level performance analysis

User code DL Framework µ-cuDNN cuDNN NVIDIA GPU µ-cuDNN Python etc. C/C++ C CUDA Python (N)FS

File-based DB

1. 2.

Figure 4: µ-cuDNN software stack.

7/26

slide-10
SLIDE 10

Background µ-cuDNN Performance evaluation

µ-cuDNN - Methodology

  • µ-cuDNN is enabled by replacing cuDNN handle type cudnnHandle t

ILP Optimizer

Dynamic Programming Optimizer

UcudnnConvolution* DL Framework µ-cuDNN

for(i = 1..L) { cudnnGetConvolution*Algorithm(· · · ); cudaMalloc(&ws[i], · · · ); } // Training loop for(· · · ) { for(i = 1..L) cudnnConvolution*(· · · , ws[i], · · · ); } In-memory

  • ptimization

result cache

Figure 5: Workflow of µ-cuDNN.

8/26

slide-11
SLIDE 11

Background µ-cuDNN Performance evaluation

µ-cuDNN - Methodology

  • µ-cuDNN is enabled by replacing cuDNN handle type cudnnHandle t
  • 1. The DL framework passes layer’s metadata via

cudnnGetConvolution*Algorithm

ILP Optimizer

Dynamic Programming Optimizer

UcudnnConvolution* DL Framework µ-cuDNN

for(i = 1..L) { cudnnGetConvolution*Algorithm(· · · ); cudaMalloc(&ws[i], · · · ); } // Training loop for(· · · ) { for(i = 1..L) cudnnConvolution*(· · · , ws[i], · · · ); }

Metadata

In-memory

  • ptimization

result cache

1.

Figure 5: Workflow of µ-cuDNN.

8/26

slide-12
SLIDE 12

Background µ-cuDNN Performance evaluation

µ-cuDNN - Methodology

  • µ-cuDNN is enabled by replacing cuDNN handle type cudnnHandle t
  • 1. The DL framework passes layer’s metadata via

cudnnGetConvolution*Algorithm

  • 2. µ-cuDNN runs ILP (or DP optimizer) and returns resulting workspace size

ILP Optimizer

Dynamic Programming Optimizer

UcudnnConvolution* DL Framework µ-cuDNN

for(i = 1..L) { cudnnGetConvolution*Algorithm(· · · ); cudaMalloc(&ws[i], · · · ); } // Training loop for(· · · ) { for(i = 1..L) cudnnConvolution*(· · · , ws[i], · · · ); }

Metadata WS size

In-memory

  • ptimization

result cache

1. 2.

Figure 5: Workflow of µ-cuDNN.

8/26

slide-13
SLIDE 13

Background µ-cuDNN Performance evaluation

µ-cuDNN - Methodology

  • µ-cuDNN is enabled by replacing cuDNN handle type cudnnHandle t
  • 1. The DL framework passes layer’s metadata via

cudnnGetConvolution*Algorithm

  • 2. µ-cuDNN runs ILP (or DP optimizer) and returns resulting workspace size
  • 3. The framework calls cudnnConvolution* with the workspace size

ILP Optimizer

Dynamic Programming Optimizer

UcudnnConvolution* DL Framework µ-cuDNN

for(i = 1..L) { cudnnGetConvolution*Algorithm(· · · ); cudaMalloc(&ws[i], · · · ); } // Training loop for(· · · ) { for(i = 1..L) cudnnConvolution*(· · · , ws[i], · · · ); }

Metadata WS size Metadata WS pointer

In-memory

  • ptimization

result cache

1. 2. 3.

Figure 5: Workflow of µ-cuDNN.

8/26

slide-14
SLIDE 14

Background µ-cuDNN Performance evaluation

µ-cuDNN - Methodology

  • µ-cuDNN is enabled by replacing cuDNN handle type cudnnHandle t
  • 1. The DL framework passes layer’s metadata via

cudnnGetConvolution*Algorithm

  • 2. µ-cuDNN runs ILP (or DP optimizer) and returns resulting workspace size
  • 3. The framework calls cudnnConvolution* with the workspace size
  • 4. µ-cuDNN internally calls the convolution function one or more times

ILP Optimizer

Dynamic Programming Optimizer

UcudnnConvolution* DL Framework µ-cuDNN

for(i = 1..L) { cudnnGetConvolution*Algorithm(· · · ); cudaMalloc(&ws[i], · · · ); } // Training loop for(· · · ) { for(i = 1..L) cudnnConvolution*(· · · , ws[i], · · · ); }

Metadata WS size Metadata WS pointer

In-memory

  • ptimization

result cache

1. 2. 3. 4.

Figure 5: Workflow of µ-cuDNN.

8/26

slide-15
SLIDE 15

Background µ-cuDNN Performance evaluation

Workspace policies

  • µ-cuDNN employs one of two workspace utilization policies
  • Workspace Reuse (WR): Each layer reuses a private workspace
  • Workspace Division (WD): Each layer uses a part of an unified

workspace

WR WD Maximum total WS size O(# of layer) constant Optimizer DP DP+ILP WS owner DL framework µ-cuDNN

9/26

slide-16
SLIDE 16

µ-cuDNN

WR

User code DL Framework µ-cuDNN cuDNN NVIDIA GPU µ-cuDNN Python etc. C/C++ C CUDA Python (N)FS

File-based DB

slide-17
SLIDE 17

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • Problem: Given a mini-batch size B and the fastest execution time

Tµ(b) (b = 1, 2, · · · , B), compute T(B) where T(b) = min { Tµ(b), minb′=1,2,...,b−1 T(b′) + T(b − b′) }

10/26

slide-18
SLIDE 18

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • Problem: Given a mini-batch size B and the fastest execution time

Tµ(b) (b = 1, 2, · · · , B), compute T(B) where T(b) = min { Tµ(b), minb′=1,2,...,b−1 T(b′) + T(b − b′) }

Time conv1 b = 60 Tµ(60) T(60)

10/26

slide-19
SLIDE 19

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • Problem: Given a mini-batch size B and the fastest execution time

Tµ(b) (b = 1, 2, · · · , B), compute T(B) where T(b) = min { Tµ(b), minb′=1,2,...,b−1 T(b′) + T(b − b′) }

Time conv1 b = 60 conv1 b = 60 T(120) Tµ(60) Tµ(60)

10/26

slide-20
SLIDE 20

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • Problem: Given a mini-batch size B and the fastest execution time

Tµ(b) (b = 1, 2, · · · , B), compute T(B) where T(b) = min { Tµ(b), minb′=1,2,...,b−1 T(b′) + T(b − b′) }

Time conv1 b = 60 conv1 b = 60 conv1 b = 60 conv1 b = 60 conv1 b = 16 T(256)

10/26

slide-21
SLIDE 21

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • Solution: Use Dynamic Programming:

Algorithm 3 DP-based solution of WR policy.

1: for b = 1 to B do 2:

ˆ bµ ← argmin

bµ=1,2,...,b

{Tµ(bµ) + T(b − bµ)}

3:

T(b) ← Tµ( ˆ bµ) + T(b − ˆ bµ)

4:

c(b) ← {cµ( ˆ bµ)} + c(b − ˆ bµ)

5: end for 6: return c(B) // Configuration; a list of (algorithm ID, batch size)

Time conv1

cµ(100) = (FFT, 100)

conv1

cµ(100) = (FFT, 100)

conv1

cµ(56) = (GEMM, 56)

c(256) = {(FFT, 100), (FFT, 100), (GEMM, 56)}

11/26

slide-22
SLIDE 22

Background µ-cuDNN Performance evaluation

Workspace policies - WR

  • In practice, µ-cuDNN uses one of three different micro-batch size

granularities

  • µ-cuDNN with the undivided option acts as cuDNN
  • µ-cuDNN increases the number of algorithms by exploiting higher

computation precision (PSEUDO HALF) than specified (TRUE HALF) without decreasing the accuracy

Table 1: Micro-batch size policies.

Micro-batch size set all {1, 2, 3, · · · , B} powerOfTwo {20, 21, 22, · · · , B} undivided {B}

Table 2: cuDNN’s convolution datatypes.

Configuration Data Compute FFT Type Type TRUE HALF half half PSEUDO HALF half float ✓ FLOAT float float ✓

12/26

slide-23
SLIDE 23

µ-cuDNN

WD

User code DL Framework µ-cuDNN cuDNN NVIDIA GPU µ-cuDNN Python etc. C/C++ C CUDA Python (N)FS

File-based DB

slide-24
SLIDE 24

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

13/26

slide-25
SLIDE 25

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) Minimization of the total computation time s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

13/26

slide-26
SLIDE 26

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M Total memory comsumption constraint

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

13/26

slide-27
SLIDE 27

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

  • Solution: Use ILP:

min T = ∑

k∈K

c∈Ck

Tk(c)xk,c s.t. ∑

k∈K

c∈Ck

Mk(c)xk,c ≤ M ∑

c∈Ck

xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck)

  • where f(k) = c ⇔ xk,c = 1

13/26

slide-28
SLIDE 28

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

  • Solution: Use ILP:

min T = ∑

k∈K

c∈Ck

Tk(c)xk,c s.t. ∑

k∈K

c∈Ck

Mk(c)xk,c ≤ M ∑

c∈Ck

xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck)

  • where f(k) = c ⇔ xk,c = 1

13/26

slide-29
SLIDE 29

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

  • Solution: Use ILP:

min T = ∑

k∈K

c∈Ck

Tk(c)xk,c s.t. ∑

k∈K

c∈Ck

Mk(c)xk,c ≤ M ∑

c∈Ck

xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck)

  • where f(k) = c ⇔ xk,c = 1

13/26

slide-30
SLIDE 30

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • Problem:

argmin

f:K→Ck

k∈K

f(k)Tk(c) s.t. ∑

k∈K

f(k)Mk(c) ≤ M

  • where
  • M: Total workspace limit
  • K: A set of convolution kernels
  • Ck: A set of configurations of

kernel k

  • Tk(c): The fastest execution

time of kernel k and configuration c

  • Solution: Use ILP:

min T = ∑

k∈K

c∈Ck

Tk(c)xk,c s.t. ∑

k∈K

c∈Ck

Mk(c)xk,c ≤ M ∑

c∈Ck

xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck) Assign one configuration for each layer

  • where f(k) = c ⇔ xk,c = 1

13/26

slide-31
SLIDE 31

Background µ-cuDNN Performance evaluation

Workspace policies - WD

  • We only use configurations on the Pareto front to enumerate

“desirable” configurations Ck: T(c) M(c) c′ c ̸∈ Ck Pareto front Ck

Figure 6: Pareto front of configurations.

14/26

slide-32
SLIDE 32

Background µ-cuDNN Performance evaluation

Workspace policies - WD

2 4 6 8 10 20 40 60 80 100 120 Time [ms] Workspace [MiB] IMPLICIT GEMM IMPLICIT PRECOMP GEMM GEMM FFT FFT TILING WINOGRAD NONFUSED

Figure 7: Pareto front of AlexNet’s “conv2” layer on P100-SXM2.

15/26

slide-33
SLIDE 33

µ-cuDNN

High-level optimization frontend

User code DL Framework µ-cuDNN cuDNN NVIDIA GPU µ-cuDNN Python etc. C/C++ C CUDA Python (N)FS

File-based DB

slide-34
SLIDE 34

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • µ-cuDNN’s Python interface performs framework-independent

performance analysis

  • by passing layers’ metadata via file-based database (1.,2.,3.)

User code DL Framework µ-cuDNN cuDNN NVIDIA GPU µ-cuDNN Python etc. C/C++ C CUDA Python (N)FS

File-based DB

1. 2. 3.

Figure 8: µ-cuDNN software stack.

16/26

slide-35
SLIDE 35

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • We provide a function to minimize training time of data-parallel

training by assigning different micro-batch sizes to heterogeneous GPUs

  • We ignore time to perform inter-GPU all-reduce since

communication is typically overlapped with computation

Time 750Ti: (b = 6) Forward Backward Forward · · · K20Xm: (b = 10) Forward Backward Forward · · · K80: (b = 16) Forward Backward Forward · · · All-reduce

Figure 9: Data-parallel training on a heterogeneous GPU cluster.

17/26

slide-36
SLIDE 36

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • Solution: Given
  • G: A set of GPUs
  • B: A mini-batch size
  • tg,b: Computation time on GPU g with a batch size of b
  • B: A batch-size set,

compute min max

g∈G

{∑

b∈B

tg,bxg,b } s.t. ∑

b∈B

xg,b ≤ 1 (∀g ∈ G) ∑

g∈G

b∈B

bxg,b = B xg,b ∈ {0, 1} (∀g ∈ G, ∀b ∈ B) GPU1 x1,b1 = 1 GPU2 x2,b2 = 1 GPU3 x3,b3 = 1 Time Batch size B T G = {1, 2, 3} t1,b1 b1

Figure 10: Illustration of the ILP problem.

18/26

slide-37
SLIDE 37

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • Solution: Given
  • G: A set of GPUs
  • B: A mini-batch size
  • tg,b: Computation time on GPU g with a batch size of b
  • B: A batch-size set,

compute min max

g∈G

{∑

b∈B

tg,bxg,b } Minimize the slowest GPU s.t. ∑

b∈B

xg,b ≤ 1 (∀g ∈ G) ∑

g∈G

b∈B

bxg,b = B xg,b ∈ {0, 1} (∀g ∈ G, ∀b ∈ B) GPU1 x1,b1 = 1 GPU2 x2,b2 = 1 GPU3 x3,b3 = 1 Time Batch size B T G = {1, 2, 3} t1,b1 b1

Figure 10: Illustration of the ILP problem.

18/26

slide-38
SLIDE 38

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • Solution: Given
  • G: A set of GPUs
  • B: A mini-batch size
  • tg,b: Computation time on GPU g with a batch size of b
  • B: A batch-size set,

compute min max

g∈G

{∑

b∈B

tg,bxg,b } s.t. ∑

b∈B

xg,b ≤ 1 (∀g ∈ G) Select one batch size for each GPU ∑

g∈G

b∈B

bxg,b = B xg,b ∈ {0, 1} (∀g ∈ G, ∀b ∈ B) GPU1 x1,b1 = 1 GPU2 x2,b2 = 1 GPU3 x3,b3 = 1 Time Batch size B T G = {1, 2, 3} t1,b1 b1

Figure 10: Illustration of the ILP problem.

18/26

slide-39
SLIDE 39

Background µ-cuDNN Performance evaluation

High-level optimization frontend

  • Solution: Given
  • G: A set of GPUs
  • B: A mini-batch size
  • tg,b: Computation time on GPU g with a batch size of b
  • B: A batch-size set,

compute min max

g∈G

{∑

b∈B

tg,bxg,b } s.t. ∑

b∈B

xg,b ≤ 1 (∀g ∈ G) ∑

g∈G

b∈B

bxg,b = B The total batch size should be equal to the mini-batch size xg,b ∈ {0, 1} (∀g ∈ G, ∀b ∈ B) GPU1 x1,b1 = 1 GPU2 x2,b2 = 1 GPU3 x3,b3 = 1 Time Batch size B T G = {1, 2, 3} t1,b1 b1

Figure 10: Illustration of the ILP problem.

18/26

slide-40
SLIDE 40

Performance evaluation

slide-41
SLIDE 41

Background µ-cuDNN Performance evaluation

Evaluation environment

  • GPUs: K80, P100-SXM2, V100-SXM2, K20Xm, 750Ti
  • cuDNN: 7.1 (or 6.0 for Caffe and TensorFlow)
  • Frameworks: Caffe 1.0, TensorFlow 1.4.1
  • LP solver: GNU Linear Programming Kit (GLPK) 4.63

Table 3: GPU specification.

Generation TFlop/s Memory Tensor Host FP32 FP161 [GiB] cores K80 Kepler 8.73

  • 24
  • KFC2

P100-SXM2 Pascal 10.6 21.2 16

  • T33

V100-SXM2 Volta 15.7 125 16 ✓ DGX-1 GTX 750Ti Maxwell 1.31

  • 2
  • KFC

K20Xm Kepler 3.95

  • 6
  • KFC

1including Tensor Cores’ mixed-precision arithmetic 2TSUBAME-KFC/DL supercomputer 3TSUBAME 3.0 supercomputer

19/26

slide-42
SLIDE 42

Background µ-cuDNN Performance evaluation

Single convolutional layer

  • µ-cuDNN achieves 2.33x speedup on AlexNet’s “conv2” layer by

utilizing both FFT-based convolution and Winograd’s algorithm

  • GEMM-based convolution requires only 4.3 KiB for workspace but

slow

  • FFT-based convolution is faster than GEMM but it requires 213 MiB
  • f workspace with a mini-batch size of 256

a p u Time [ms] 1 2 3 4 5 6 7 IMPLICIT PRECOMP GEMM FFT TILING WINOGRAD NONFUSED 32 32 48 48 48 48 32 32 32 32 32 32 32 32 256

2.33x

Figure 11: Time (bars) and micro-batch sizes (labels in bars) of forward convolution of AlexNet’s “conv2” layer on P100-SXM2.

  • u:undivided

(cuDNN)

  • p:powerOfTwo
  • a:all

20/26

slide-43
SLIDE 43

Background µ-cuDNN Performance evaluation

DeepBench

  • We evaluate µ-cuDNN with DeepBench’s 94 convolutional layers
  • µ-cuDNN achieves up to 4.54x speedup (1.60x on average) on a

V100-SXM2-GPU with Tensor Cores

  • µ-cuDNN exploits PSEUDO HALF in 69% of the layers
  • µ-cuDNN also achieves 1.16x, 1.73x average speedups for 3 × 3

kernels on P100 and V100 respectively

1 2 3 4 Speedup K80 P100-SXM2 P100-SXM2 (half) V100-SXM2 V100-SXM2 (half) V100-SXM2 (Tensor Cores) 1×1 3×3 5×5 7×7 10×5 20×5

4.54x 1.16x 1.73x

min. 1st quartile median mean 3rd quartile max.

Figure 12: Relative speedup of DeepBench’s forward convolution against cuDNN.

21/26

slide-44
SLIDE 44

Background µ-cuDNN Performance evaluation

Caffe - WR policy

  • µ-cuDNN on Caffe framework achieves 1.45x speedup (and 1.60x

w.r.t. convolutions alone) on V100-SXM2 GPU

  • achieves less speedups with tiny workspace (8 MiB) or huge

workspace (512 MiB), due to lack of effectiveness of micro-batching

  • µ-cuDNN achieves similar speedups on TensorFlow

u p a u p a u p a Time [ms] 200 400 600 800 1000 (8 MiB) (64 MiB) (512 MiB) etc. conv5 conv4 conv3 conv2 conv1

1.81x

(a) K80

u p a u p a u p a Time [ms] 50 100 150 200 (8 MiB) (64 MiB) (512 MiB)

1.40x

(b) P100-SXM2

u p a u p a u p a Time [ms] 20 40 60 80 100 120 140 (8 MiB) (64 MiB) (512 MiB)

1.45x 1.60x

(c) V100-SXM2

Figure 13: Benchmark results of AlexNet on three different GPUs with different workspace sizes (8, 64, 512 MiB).

22/26

slide-45
SLIDE 45

Background µ-cuDNN Performance evaluation

Caffe - WD policy

  • µ-cuDNN on Caffe framework achieves 1.38x and 1.14x speedup for

convolutional layers of AlexNet and ResNet-50 on P100-SXM2 GPU

  • Time to solve the ILP problem was negligible (5.46 ms for

ResNet-50)

u p a u p a u p a u p a u p a u p a Time [ms] 50 100 150 200 (8 MiB) (64 MiB) (512 MiB)

(WR) (WD) (WR) (WD) (WR) (WD)

etc. conv5 . . . conv1

1.38x

Figure 14: Benchmark results of AlexNet on P100-SXM2 with different workspace sizes and policies.

23/26

slide-46
SLIDE 46

Background µ-cuDNN Performance evaluation

Caffe - WD policy

  • cuDNN with the WD policy spares most of the workspace for

“conv2” and “conv3”

  • which are the most time-consuming and can be accelerated by

efficient convolution algorithms

u p a u p a Workspace [MiB] 20 40 60 80 100 120 140 F BF BD BF F BF BD BF (WR) (WD) conv5 conv4 conv3 conv2 conv1

Figure 15: Assigned workspace division of AlexNet on P100-SXM2.

24/26

slide-47
SLIDE 47

Background µ-cuDNN Performance evaluation

Case study: Heterogeneous cluster optimization

  • We estimate time of forward-backward passes of ResNet-18 on three

GPUs: 750Ti, K20Xm and K80

  • The GPUs accelerate training of ResNet-18 by 2.20x than one K80

GPU chip using the same mini-batch size

  • Time to perform all-reduce was negligible
  • 2.63 ms, 1 MiB data, 3 nodes, MVAPICH 2.3a

Time [ms] 100 200 300 400 750Ti K20Xm K80 750Ti K20Xm 750Ti K80 K20Xm K80 K80×2 750Ti K20Xm K80 750Ti K80×2 K20Xm K80×2 750Ti K20Xm K80×2 334.1 154.2 130.8 123 107.2 80.5 70.3 70.3 65.9 61.8 59.5 750Ti K20Xm K80 32 32 32 11 21 9 23 15 17 16 16 6 10 16 5 13 14 8 12 12 5 7 10 10

2.20x

Figure 16: Estimated time of forward-backward passes of ResNet-18 on heterogeneous GPUs.

25/26

slide-48
SLIDE 48

Background µ-cuDNN Performance evaluation

Conclusion

  • µ-cuDNN is a “free-lunch” auto-tuning library for cuDNN
  • does not violate the semantics of computation
  • even improves computation precision for better performance in some

cases

  • is independent from the underlying framework
  • generalizes performance analysis around convolutional layers

µ-cuDNN is available online at https://github.com/spcl/ucudnn

26/26