Less is More: Accelerating Deep Neural Networks with Micro-Batching
Yosuke Oyama1, a, Tal Ben-Nun2, Torsten Hoefler2, Satoshi Matsuoka1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162 2017/12/19
1
Less is More: Accelerating Deep Neural Networks with Micro-Batching - - PowerPoint PPT Presentation
Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162
Yosuke Oyama1, a, Tal Ben-Nun2, Torsten Hoefler2, Satoshi Matsuoka1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162 2017/12/19
1
´Adopted by most of deep learning frameworks ´Contains multiple convolution algorithms for CNNs ´GEMM, direct, FFT, Winograd, … ´Most algorithms use workspace: A buffer in GPU memory to store intermediate data
2
C W H U V C X 1 1 1 W H K Y
Σ
W N
2D Convolution (forward) Y[n, k, h, w] = Σc,u,v W[k, c, u, v] * X[n, c, h+u, w+v];
´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms
´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 3
400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2
0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED
Execution time vs. workspace size of AlexNet conv2 (forward)
Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms
´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 4
400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2
0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED
Execution time vs. workspace size of AlexNet conv2 (forward)
Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
If workspace limit 323 MiB if workspace limit < 323 MiB
´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms
´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 5
400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2
0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED
Execution time vs. workspace size of AlexNet conv2 (forward)
Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
If workspace limit 323 MiB if workspace limit < 323 MiB 4.51x
´Observation: Less batch size for More executable performance
´Faster algorithms can be enabled by dividing the mini-batch
6
Computation performance and workspace size of FFT_TILING
50 100 150 200 250 20 40 60 80 Batch size Images/time [ms−1]
200 300 Workspace size [MiB]
Workspace size [MiB]
´Observation: Less batch size for More executable performance
´Faster algorithms can be enabled by dividing the mini-batch
7
Computation performance and workspace size of FFT_TILING
50 100 150 200 250 20 40 60 80 Batch size Images/time [ms−1]
200 300 Workspace size [MiB]
Workspace size [MiB] 93% of performance with 58% of workspace
´µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for cuDNN convolution ´µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and Integer Linear Programming
´µ-cuDNN on NVIDIA Tesla P100-SMX GPU achieves ´up to 2.33x speedup for single convolution ´up to 1.63x speedup for convolution of a CNN
8
´is installed by replacing cudnnHandle_t to UcudnnHandle_t in deep learning frameworks ´e.g. Caffe requires 3 lines of modification ´overloads some of cuDNN functions ´It internally divides cudnnConvolution* into multiple convolutions ´It delegates other functions to cuDNN itself
9
´Workspace Reuse (WR) ´Each layer reuses a private workspace ´Total workspace size is O(#layer) ´Workspace Division (WD) ´Each layer divides an unified workspace and use part of it ´Total workspace size is constant
10
Workspace Reuse (WR) Workspace Division (WD) Total WS size is up to [WS limit/layer] * [#layer] [total WS limit] µ-batch division is optimized by DP DP + ILP WS is managed by DL frameworks µ-cuDNN WS limit is passed by cuDNN interface An environment variable
T(B) = min
minb=1,2,...,B−1 T(b) + T(B − b)
Time
T(256) conv1 batch=60 conv1 batch=60 conv1 batch=60 conv1 batch=60 conv1 batch=16 T(60) = Tμ(60) T(196)
and its micro-configuration cμ(b) = (a, b) ´ where a is algorithm ID and B is mini-batch size ´ Tμ(b) and a are obtain by cudnnFindConvolution*Algorithm ´ Ball(B) = {1,2,…,B}, BpowerOfTwo(B) = {20,21,…,B}, Bundivided(B) = {B}
b1, ˆ b2 = argmin
b1+b2=b
{Tµ(b1) + T(b2)} T(b) = Tµ( ˆ b1) + T( ˆ b2) c(b) = {cµ( ˆ b1)} ∪ c( ˆ b2)
12
c(256) = {(4, 60), (4, 60), (4, 60), (4, 60), (0, 16)}
T(256) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (0, 16) Tμ(60)
cμ(60) = (4, 60)
T(196)
c(196) = {(4, 60), (4, 60), (4, 60), (0, 16)} Time
Tμ(60)
cμ(60) = (4, 60)
T(136)
c(136) = {(4, 60), (4, 60), (0, 16)} 13
´M: total workspace size ´K: A set of convolution kernels ´Ck: A set of configurations for kernel k ´Tk(c), Mk(c): Execution time and workspace size for config. c
min. T = X
k2K
X
c2Ck
Tk(c)xk,c subject to. X
k2K
X
c2Ck
Mk(c)xk,c ≤ M X
c2Ck
xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck)
be less than M
Exactly one configuration should be selected for each kernel
Total execution time
xk,c=1 configuration c is selected for kernel k
14
15 conv1 u C1 conv2 v C2 conv k c Ck Time Total workspace size M M2(v) T2(v)
x1,u = 1 x2,v = 1 xk,c = 1 cμ cμ cμ cμ
´The total number is Ω(|#algo|B)
´Definition. A configuration c is desirable among a set C c’ C, T(c) < T(c’) M(c) < M(c’)
M(c) T(c) c’ c c is undesirable, because it
16
´We apply the pruning D(C) = {cC|c’ C, T(c) < T(c’) M(c) < M(c’)} for each iteration
´µ-cuDNN uses GNU Linear Programming Kit (GLPK) as a LP solver
17
´All CNN tensors are stored in float, NCHW format ´Workspace size limit is set to 8, 64, 512 MiB
´NVIDIA Tesla P100-SMX2 @ TSUBAME 3.0 ´10.6 SP TFlop/s ´24 GiB GDDR5 memory, 480 GiB/s bandwidth ´NVIDIA Tesla K80 @ TSUBAME-KFC/DL ´8.73 SP TFlop/s ´16 GiB HBM2 memory, 732 GiB/s bandwidth
18
undivided powerOfTwo all Execution time [ms] 1 2 3 4 5 6 IMPLICIT_PRECOMP_GEMM FFT_TILING WINOGRAD_NONFUSED 256 32 32 32 32 32 32 32 32 32 32 48 48 48 48
cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SMX2
Workspace size of 64 MiB, mini-batch size of 256 Numbers on each rectangles represent micro-batch sizes
19
´µ-cuDNN achieved 1.40x speedup w.r.t. training iteration (or 1.63x w.r.t. only convolution) of AlexNet in P100-SMX2
´Speedup was nearly 1.81x (or 2.10x) in K80
u: undivided p: powerOfTwo a: all
Benchmark result of AlexNet on NVIDIA Tesla P100-SMX2
workspace size of 8, 64, 512MiB, mini-batch size of 256
u (8MiB) p (8MiB) a (8MiB) u (64MiB) p (64MiB) a (64MiB) u (512MiB) p (512MiB) a (512MiB) Execution time[ms] 50 100 150 conv1 conv2 conv3 conv4 conv5 etc.
1.63x 1.40x
20
u (8MiB) p (8MiB) a (8MiB) u (64MiB) p (64MiB) a (64MiB) u (512MiB) p (512MiB) a (512MiB) Execution time[ms] 50 100 150 200 250
´µ-cuDNN achieved 1.11x speedup w.r.t. training iteration (or 1.21x w.r.t. only convolution) of ResNet-18 in P100-SMX2
Benchmark result of ResNet-18 on NVIDIA Tesla P100-SMX2
workspace size of 8, 64, 512MiB, mini-batch size of 128
1.21x 1.11x
convolutional layers Other layers 21
4 6 8 10 20 40 60 80 100 120 Execution time [ms] Workspace size [MiB]
IMPLICIT_GEMM IMPLICIT_PRECOMP_GEMM GEMM FFT FFT_TILING WINOGRAD_NONFUSED
A desirable configuration set of AlexNet conv2 (Forward) Mini-batch size of 256, P100-SMX2 Each bar represents proportion of micro-batch sizes and algorithms 22
´the ILP-based algorithm nearly fully utilize the workspace
undivided (WR) powerOfTwo (WR) all (WR) undivided (WD) powerOfTwo (WD) all (WD) Total workspace size [MiB] 20 40 60 80 100 120 140 F BD F BD F BD F BD F BF BD F BF BD F BF BD F BF BD Total workspace limit (120 MiB) conv1 conv2 conv3 conv4 conv5
Breakdown of workspace size of AlexNet Mini-batch size of 256, total workspace size of 120MiB, P100-SMX2
Performance-sensitive kernels (conv2, conv3) aggressively occupy the workspace
23
undivided (WR, 8 MiB) all (WR, 8 MiB) undivided (WD, 120 MiB) all (WD, 120 MiB) undivided (WR, 64 MiB) all (WR, 64 MiB) undivided (WD, 960 MiB) all (WD, 960 MiB) undivided (WR, 512 MiB) all (WR, 512 MiB) undivided (WD, 7680 MiB) all (WD, 7680 MiB) Execution time [ms] 50 100 150
conv1 conv2 conv3 conv4 conv5 etc. (WD)
Benchmark result of AlexNet on NVIDIA Tesla P100-SMX2
workspace size of 8, 64, 512MiB per kernel, mini-batch size of 256 WD even beats WR with 8x more total memory size 1.24x 1.38x
24
undivided (WR, 8 MiB) all (WR, 8 MiB) undivided (WD, 1272 MiB) all (WD, 1272 MiB) undivided (WR, 16 MiB) all (WR, 16 MiB) undivided (WD, 2544 MiB) all (WD, 2544 MiB) undivided (WR, 32 MiB) all (WR, 32 MiB) undivided (WD, 5088 MiB) all (WD, 5088 MiB) Execution time [ms] 50 100 150 200
conv etc. (WD)
Benchmark result of ResNet-50 on NVIDIA Tesla P100-SMX2
workspace size of 8, 16, 32MiB per kernel, mini-batch size of 32 1.05x 1.14x
25
´cuDNN is fast, but DL frameworks cannot utilize it well
´It is conceptually equivalent to cuDNN ´If it is slower than original cuDNN, it will use cuDNN ´The maximum memory consumption can be controlled
´up to 2.33x speedup for single convolution ´up to 1.63x speedup for convolution of a CNN
26