less is more accelerating deep neural networks with micro
play

Less is More: Accelerating Deep Neural Networks with Micro-Batching - PowerPoint PPT Presentation

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162


  1. Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter � 162 ������������������������ 2017/12/19

  2. Background: cuDNN Convolution 2 ´ NVIDIA cuDNN : A deep learning kernel library for NVIDIA GPUs ´ Adopted by most of deep learning frameworks ´ Contains multiple convolution algorithms for CNNs ´ GEMM, direct, FFT, Winograd, … ´ Most algorithms use workspace : A buffer in GPU memory to store intermediate data N Y [ n , k , h , w ] = X Y Σ c , u , v W [ k , c , u , v ] * X [ n , c , h + u , w + v ]; K C C 1 Σ H W 1 H V 1 U W W 2D Convolution (forward)

  3. Background: cuDNN Convolution 3 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

  4. Background: cuDNN Convolution 4 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

  5. Background: cuDNN Convolution 5 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 4.51x 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

  6. Background: cuDNN Convolution 6 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)

  7. Background: cuDNN Convolution 7 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 93% of performance with 58% of workspace 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)

  8. Approach and Contribution 8 ´ Approach: µ-cuDNN, a wrapper library for cuDNN ´ µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for cuDNN convolution ´ µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and Integer Linear Programming ´ Contribution: ´ µ-cuDNN on NVIDIA Tesla P100-SMX GPU achieves ´ up to 2.33x speedup for single convolution ´ up to 1.63x speedup for convolution of a CNN

  9. Proposal: µ-cuDNN 9 ´ µ-cuDNN : a C++ transparent wrapper for cuDNN ´ is installed by replacing cudnnHandle_t to UcudnnHandle_t in deep learning frameworks ´ e.g. Caffe requires 3 lines of modification ´ overloads some of cuDNN functions ´ It internally divides cudnnConvolution* into multiple convolutions ´ It delegates other functions to cuDNN itself

  10. Proposal: Workspace policies of µ-cuDNN 10 ´ µ-cuDNN supports two different workspace policies ´ Workspace Reuse (WR) ´ Each layer reuses a private workspace ´ Total workspace size is O(#layer) ´ Workspace Division (WD) ´ Each layer divides an unified workspace and use part of it ´ Total workspace size is constant Workspace Reuse (WR) Workspace Division (WD) Total WS size is up to [WS limit/layer] * [#layer] [total WS limit] µ-batch division is optimized by DP DP + ILP WS is managed by DL frameworks µ-cuDNN WS limit is passed by cuDNN interface An environment variable

  11. Proposal: WR using Dynamic Programming 11 ´ Problem: Given a mini-batch size B and the fastest execution time T μ ( b ) for b =1, 2, …, B , compute � � T µ ( B ) T ( B ) = min min b =1 , 2 ,...,B − 1 T ( b ) + T ( B − b ) ´ and get the mini-batch division (“configuration” in this work) Time T (256) conv1 conv1 conv1 conv1 conv1 batch=60 batch=60 batch=60 batch=60 batch=16 T (60) = T μ (60) T (196)

  12. � � � � � � � � � � � � � � � � � Proposal: WR using Dynamic Programming 12 1. for b � B policy ( B) , benchmarks the fastest execution time T μ ( b ) and its micro-configuration c μ ( b ) = ( a , b ) ´ where a is algorithm ID and B is mini-batch size ´ T μ ( b ) and a are obtain by cudnnFindConvolution*Algorithm ´ B all ( B ) = {1,2,…, B }, B powerOfTwo ( B ) = {2 0 ,2 1 ,…, B }, B undivided ( B ) = { B } 2. for b = 1, 2, …, B , computes b 1 , ˆ ˆ = argmin { T µ ( b 1 ) + T ( b 2 ) } b 2 b 1 + b 2 = b T µ ( ˆ b 1 ) + T ( ˆ T ( b ) = b 2 ) { c µ ( ˆ b 1 ) } ∪ c ( ˆ c ( b ) = b 2 ) 3. outputs configuration (a list of micro-configurations) c ( B )

  13. Proposal: WR using Dynamic Programming 13 Time c (256) = {(4, 60), (4, 60), (4, 60), (4, 60), (0, 16)} T (256) conv1 conv1 conv1 conv1 conv1 c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (0, 16) T μ (60) T (196) c μ (60) = (4, 60) c (196) = {(4, 60), (4, 60), (4, 60), (0, 16)} T μ (60) T (136) c μ (60) = (4, 60) c (136) = {(4, 60), (4, 60), (0, 16)}

  14. � � � � � � � Proposal: WD using Integer LP 14 ´ Problem: X X Total execution time min . T = T k ( c ) x k,c k 2 K c 2 C k Total workspace size should X X subject to . M k ( c ) x k,c ≤ M be less than M � k 2 K c 2 C k Exactly one configuration should X x k,c = 1 ( ∀ k ∈ K ) be selected for each kernel c 2 C k x k , c =1 � configuration c is x k,c ∈ { 0 , 1 } ( ∀ k ∈ K, ∀ c ∈ C k ) selected for kernel k ´ M : total workspace size ´ K : A set of convolution kernels ´ C k : A set of configurations for kernel k ´ T k ( c ), M k ( c ) : Execution time and workspace size for config. c

  15. Proposal: WD using Integer LP 15 Total workspace size x k , c = 1 M conv k x 2, v = 1 c � C k x 1, u = 1 conv2 M 2 ( v ) v � C 2 conv1 T 2 ( v ) u � C 1 Time min. T c μ c μ c μ c μ

  16. Proposal: WD using Integer LP 16 ´ Challenge: How to enumerate practical number of configurations (i.e. 0-1 variables) for each kernel � ´ The total number is Ω (|# algo | B ) ´ Solution: Pruning “undesirable” configurations ´ Definition. A configuration c is desirable among a set C � � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’) T ( c ) c is undesirable, because it is slower than c ’ • c requires more memory than c ’ • c’ M ( c )

  17. Proposal: WD using Integer LP 17 1. For each convolution kernel, list up all “desirable” configurations using the DP ´ We apply the pruning D ( C ) = { c � C | � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’)} for each iteration 2. Pass the output (configurations) to the ILP problem 3. Solve the ILP problem ´ µ-cuDNN uses GNU Linear Programming Kit (GLPK) as a LP solver

  18. Evaluation 18 ´ Software: Caffe 1.0, cuDNN 6.0, CUDA 8.0 ´ All CNN tensors are stored in float, NCHW format ´ Workspace size limit is set to 8, 64, 512 MiB ´ GPUs: ´ NVIDIA Tesla P100-SMX2 @ TSUBAME 3.0 ´ 10.6 SP TFlop/s ´ 24 GiB GDDR5 memory, 480 GiB/s bandwidth ´ NVIDIA Tesla K80 @ TSUBAME-KFC/DL ´ 8.73 SP TFlop/s ´ 16 GiB HBM2 memory, 732 GiB/s bandwidth

  19. Evaluation: WR using Dynamic Programming 19 ´ µ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2 6 IMPLICIT_PRECOMP_GEMM FFT_TILING 5 WINOGRAD_NONFUSED Execution time [ms] 4 256 3 32 32 48 32 2 48 32 48 32 1 32 48 32 32 32 32 0 undivided powerOfTwo all cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SMX2 Workspace size of 64 MiB, mini-batch size of 256 Numbers on each rectangles represent micro-batch sizes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend