Less is More: Accelerating Deep Neural Networks with Micro-Batching - - PowerPoint PPT Presentation

less is more accelerating deep neural networks with micro
SMART_READER_LITE
LIVE PREVIEW

Less is More: Accelerating Deep Neural Networks with Micro-Batching - - PowerPoint PPT Presentation

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162


slide-1
SLIDE 1

Less is More: Accelerating Deep Neural Networks with Micro-Batching

Yosuke Oyama1, a, Tal Ben-Nun2, Torsten Hoefler2, Satoshi Matsuoka1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162 2017/12/19

1

slide-2
SLIDE 2

Background: cuDNN Convolution

´NVIDIA cuDNN: A deep learning kernel library for NVIDIA GPUs

´Adopted by most of deep learning frameworks ´Contains multiple convolution algorithms for CNNs ´GEMM, direct, FFT, Winograd, … ´Most algorithms use workspace: A buffer in GPU memory to store intermediate data

2

C W H U V C X 1 1 1 W H K Y

Σ

W N

2D Convolution (forward) Y[n, k, h, w] = Σc,u,v W[k, c, u, v] * X[n, c, h+u, w+v];

slide-3
SLIDE 3

Background: cuDNN Convolution

´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms

´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 3

  • 200

400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2

0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED

Execution time vs. workspace size of AlexNet conv2 (forward)

Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

slide-4
SLIDE 4

Background: cuDNN Convolution

´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms

´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 4

  • 200

400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2

0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED

Execution time vs. workspace size of AlexNet conv2 (forward)

Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

If workspace limit 323 MiB if workspace limit < 323 MiB

slide-5
SLIDE 5

Background: cuDNN Convolution

´Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms

´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 5

  • 200

400 600 800 1000 5 10 15 20 Workspace size [MiB] Execution time [ms] 5 4 7 1 2

0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 2 GEMM 3 DIRECT 4 FFT 5 FFT_TILING 6 WINOGRAD 7 WINOGRAD_NONFUSED

Execution time vs. workspace size of AlexNet conv2 (forward)

Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

If workspace limit 323 MiB if workspace limit < 323 MiB 4.51x

slide-6
SLIDE 6

Background: cuDNN Convolution

´Observation: Less batch size for More executable performance

´Faster algorithms can be enabled by dividing the mini-batch

6

Computation performance and workspace size of FFT_TILING

  • f AlexNet conv2 (forward)

50 100 150 200 250 20 40 60 80 Batch size Images/time [ms−1]

  • 100

200 300 Workspace size [MiB]

  • Images/time [ms−1]

Workspace size [MiB]

slide-7
SLIDE 7

Background: cuDNN Convolution

´Observation: Less batch size for More executable performance

´Faster algorithms can be enabled by dividing the mini-batch

7

Computation performance and workspace size of FFT_TILING

  • f AlexNet conv2 (forward)

50 100 150 200 250 20 40 60 80 Batch size Images/time [ms−1]

  • 100

200 300 Workspace size [MiB]

  • Images/time [ms−1]

Workspace size [MiB] 93% of performance with 58% of workspace

slide-8
SLIDE 8

Approach and Contribution

´Approach: µ-cuDNN, a wrapper library for cuDNN

´µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for cuDNN convolution ´µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and Integer Linear Programming

´Contribution:

´µ-cuDNN on NVIDIA Tesla P100-SMX GPU achieves ´up to 2.33x speedup for single convolution ´up to 1.63x speedup for convolution of a CNN

8

slide-9
SLIDE 9

Proposal: µ-cuDNN

´µ-cuDNN: a C++ transparent wrapper for cuDNN

´is installed by replacing cudnnHandle_t to UcudnnHandle_t in deep learning frameworks ´e.g. Caffe requires 3 lines of modification ´overloads some of cuDNN functions ´It internally divides cudnnConvolution* into multiple convolutions ´It delegates other functions to cuDNN itself

9

slide-10
SLIDE 10

Proposal: Workspace policies of µ-cuDNN

´µ-cuDNN supports two different workspace policies

´Workspace Reuse (WR) ´Each layer reuses a private workspace ´Total workspace size is O(#layer) ´Workspace Division (WD) ´Each layer divides an unified workspace and use part of it ´Total workspace size is constant

10

Workspace Reuse (WR) Workspace Division (WD) Total WS size is up to [WS limit/layer] * [#layer] [total WS limit] µ-batch division is optimized by DP DP + ILP WS is managed by DL frameworks µ-cuDNN WS limit is passed by cuDNN interface An environment variable

slide-11
SLIDE 11

Proposal: WR using Dynamic Programming

´ Problem: Given a mini-batch size B and the fastest execution time Tμ(b) for b=1, 2, …, B, compute ´ and get the mini-batch division (“configuration” in this work)

T(B) = min

  • Tµ(B)

minb=1,2,...,B−1 T(b) + T(B − b)

  • 11

Time

T(256) conv1 batch=60 conv1 batch=60 conv1 batch=60 conv1 batch=60 conv1 batch=16 T(60) = Tμ(60) T(196)

slide-12
SLIDE 12

Proposal: WR using Dynamic Programming

  • 1. for b Bpolicy(B), benchmarks the fastest execution time Tμ(b)

and its micro-configuration cμ(b) = (a, b) ´ where a is algorithm ID and B is mini-batch size ´ Tμ(b) and a are obtain by cudnnFindConvolution*Algorithm ´ Ball(B) = {1,2,…,B}, BpowerOfTwo(B) = {20,21,…,B}, Bundivided(B) = {B}

  • 2. for b = 1, 2, …, B, computes
  • 3. outputs configuration (a list of micro-configurations) c(B)
  • ˆ

b1, ˆ b2 = argmin

b1+b2=b

{Tµ(b1) + T(b2)} T(b) = Tµ( ˆ b1) + T( ˆ b2) c(b) = {cµ( ˆ b1)} ∪ c( ˆ b2)

12

slide-13
SLIDE 13

Proposal: WR using Dynamic Programming

c(256) = {(4, 60), (4, 60), (4, 60), (4, 60), (0, 16)}

T(256) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (4, 60) conv1 cμ = (0, 16) Tμ(60)

cμ(60) = (4, 60)

T(196)

c(196) = {(4, 60), (4, 60), (4, 60), (0, 16)} Time

Tμ(60)

cμ(60) = (4, 60)

T(136)

c(136) = {(4, 60), (4, 60), (0, 16)} 13

slide-14
SLIDE 14

Proposal: WD using Integer LP

´Problem:

´M: total workspace size ´K: A set of convolution kernels ´Ck: A set of configurations for kernel k ´Tk(c), Mk(c): Execution time and workspace size for config. c

min. T = X

k2K

X

c2Ck

Tk(c)xk,c subject to. X

k2K

X

c2Ck

Mk(c)xk,c ≤ M X

c2Ck

xk,c = 1 (∀k ∈ K) xk,c ∈ {0, 1} (∀k ∈ K, ∀c ∈ Ck)

  • Total workspace size should

be less than M

Exactly one configuration should be selected for each kernel

Total execution time

xk,c=1 configuration c is selected for kernel k

14

slide-15
SLIDE 15

Proposal: WD using Integer LP

15 conv1 u C1 conv2 v C2 conv k c Ck Time Total workspace size M M2(v) T2(v)

  • min. T

x1,u = 1 x2,v = 1 xk,c = 1 cμ cμ cμ cμ

slide-16
SLIDE 16

Proposal: WD using Integer LP

´Challenge: How to enumerate practical number of configurations (i.e. 0-1 variables) for each kernel

´The total number is Ω(|#algo|B)

´Solution: Pruning “undesirable” configurations

´Definition. A configuration c is desirable among a set C c’ C, T(c) < T(c’) M(c) < M(c’)

M(c) T(c) c’ c c is undesirable, because it

  • is slower than c’
  • requires more memory than c’

16

slide-17
SLIDE 17

Proposal: WD using Integer LP

  • 1. For each convolution kernel, list up all “desirable”

configurations using the DP

´We apply the pruning D(C) = {cC|c’ C, T(c) < T(c’) M(c) < M(c’)} for each iteration

  • 2. Pass the output (configurations) to the ILP problem
  • 3. Solve the ILP problem

´µ-cuDNN uses GNU Linear Programming Kit (GLPK) as a LP solver

17

slide-18
SLIDE 18

Evaluation

´Software: Caffe 1.0, cuDNN 6.0, CUDA 8.0

´All CNN tensors are stored in float, NCHW format ´Workspace size limit is set to 8, 64, 512 MiB

´GPUs:

´NVIDIA Tesla P100-SMX2 @ TSUBAME 3.0 ´10.6 SP TFlop/s ´24 GiB GDDR5 memory, 480 GiB/s bandwidth ´NVIDIA Tesla K80 @ TSUBAME-KFC/DL ´8.73 SP TFlop/s ´16 GiB HBM2 memory, 732 GiB/s bandwidth

18

slide-19
SLIDE 19

undivided powerOfTwo all Execution time [ms] 1 2 3 4 5 6 IMPLICIT_PRECOMP_GEMM FFT_TILING WINOGRAD_NONFUSED 256 32 32 32 32 32 32 32 32 32 32 48 48 48 48

Evaluation: WR using Dynamic Programming

´µ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2

cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SMX2

Workspace size of 64 MiB, mini-batch size of 256 Numbers on each rectangles represent micro-batch sizes

19

slide-20
SLIDE 20

Evaluation: WR using Dynamic Programming

´µ-cuDNN achieved 1.40x speedup w.r.t. training iteration (or 1.63x w.r.t. only convolution) of AlexNet in P100-SMX2

´Speedup was nearly 1.81x (or 2.10x) in K80

u: undivided p: powerOfTwo a: all

Benchmark result of AlexNet on NVIDIA Tesla P100-SMX2

workspace size of 8, 64, 512MiB, mini-batch size of 256

u (8MiB) p (8MiB) a (8MiB) u (64MiB) p (64MiB) a (64MiB) u (512MiB) p (512MiB) a (512MiB) Execution time[ms] 50 100 150 conv1 conv2 conv3 conv4 conv5 etc.

1.63x 1.40x

20

slide-21
SLIDE 21

u (8MiB) p (8MiB) a (8MiB) u (64MiB) p (64MiB) a (64MiB) u (512MiB) p (512MiB) a (512MiB) Execution time[ms] 50 100 150 200 250

Evaluation: WR using Dynamic Programming

´µ-cuDNN achieved 1.11x speedup w.r.t. training iteration (or 1.21x w.r.t. only convolution) of ResNet-18 in P100-SMX2

Benchmark result of ResNet-18 on NVIDIA Tesla P100-SMX2

workspace size of 8, 64, 512MiB, mini-batch size of 128

1.21x 1.11x

convolutional layers Other layers 21

slide-22
SLIDE 22

Evaluation: WD using Integer LP

  • 2

4 6 8 10 20 40 60 80 100 120 Execution time [ms] Workspace size [MiB]

IMPLICIT_GEMM IMPLICIT_PRECOMP_GEMM GEMM FFT FFT_TILING WINOGRAD_NONFUSED

A desirable configuration set of AlexNet conv2 (Forward) Mini-batch size of 256, P100-SMX2 Each bar represents proportion of micro-batch sizes and algorithms 22

slide-23
SLIDE 23

´the ILP-based algorithm nearly fully utilize the workspace

Evaluation: WD using Integer LP

undivided (WR) powerOfTwo (WR) all (WR) undivided (WD) powerOfTwo (WD) all (WD) Total workspace size [MiB] 20 40 60 80 100 120 140 F BD F BD F BD F BD F BF BD F BF BD F BF BD F BF BD Total workspace limit (120 MiB) conv1 conv2 conv3 conv4 conv5

Breakdown of workspace size of AlexNet Mini-batch size of 256, total workspace size of 120MiB, P100-SMX2

Performance-sensitive kernels (conv2, conv3) aggressively occupy the workspace

23

slide-24
SLIDE 24

Evaluation: WD using Integer LP

´WD overcomes WR with same total workspace limit

undivided (WR, 8 MiB) all (WR, 8 MiB) undivided (WD, 120 MiB) all (WD, 120 MiB) undivided (WR, 64 MiB) all (WR, 64 MiB) undivided (WD, 960 MiB) all (WD, 960 MiB) undivided (WR, 512 MiB) all (WR, 512 MiB) undivided (WD, 7680 MiB) all (WD, 7680 MiB) Execution time [ms] 50 100 150

conv1 conv2 conv3 conv4 conv5 etc. (WD)

Benchmark result of AlexNet on NVIDIA Tesla P100-SMX2

workspace size of 8, 64, 512MiB per kernel, mini-batch size of 256 WD even beats WR with 8x more total memory size 1.24x 1.38x

24

slide-25
SLIDE 25

undivided (WR, 8 MiB) all (WR, 8 MiB) undivided (WD, 1272 MiB) all (WD, 1272 MiB) undivided (WR, 16 MiB) all (WR, 16 MiB) undivided (WD, 2544 MiB) all (WD, 2544 MiB) undivided (WR, 32 MiB) all (WR, 32 MiB) undivided (WD, 5088 MiB) all (WD, 5088 MiB) Execution time [ms] 50 100 150 200

conv etc. (WD)

Evaluation: WD using Integer LP

´WD even works for fine-grained CNN (ResNet-50)

Benchmark result of ResNet-50 on NVIDIA Tesla P100-SMX2

workspace size of 8, 16, 32MiB per kernel, mini-batch size of 32 1.05x 1.14x

25

slide-26
SLIDE 26

Conclusion

´µ-cuDNN maximizes the performance of cuDNN

´cuDNN is fast, but DL frameworks cannot utilize it well

´µ-cuDNN is “free lunch” due to its mathematical foundation

´It is conceptually equivalent to cuDNN ´If it is slower than original cuDNN, it will use cuDNN ´The maximum memory consumption can be controlled

´µ-cuDNN brings

´up to 2.33x speedup for single convolution ´up to 1.63x speedup for convolution of a CNN

26