accelerating deep learning frameworks with micro batches
play

Accelerating Deep Learning Frameworks with Micro-batches Yosuke - PowerPoint PPT Presentation

Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 *


  1. Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 * oyama.y.aa@m.titech.ac.jp , Presenter

  2. Background

  3. Background µ -cuDNN Performance evaluation Background • Convolution is one of the key operations in Convolutional Neural Networks (CNNs) W W ′ C Algorithm 1 Pseudo-code of two-dimensional convolution. C ′ 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop U C H ′ for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: H V for( w = 0 ; w < W ; w ++) // Width loop 4: W for( c = 0 ; c < C ; c ++) // Input channel loop 5: Y for( v = 0 ; v < V ; v ++) // Kernel width loop 6: X for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Figure 1: 2D convolution. Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 2/26

  4. Background µ -cuDNN Performance evaluation Background • NVIDIA cuDNN library provides deep learning primitives for GPUs • cuDNN provides several equivalent convolution algorithms GEMM-based Wingorad FFT-based X W Y X W Y X W Y im2col F G B T A T F F − 1 = = = X ′ · ˜ ◦ ˜ ˜ ˆ ◦ ˆ ˆ W Y X W Y X W Y Winograd domain Frequency domain : Workspace Figure 2: Three different convolution algorithms. 3/26

  5. Background µ -cuDNN Performance evaluation Background • Problem statement : cuDNN may require a workspace as large as the network itself to use efficient convolution algorithms! 800 IMPLICIT PRECOMP GEMM Data 300 FFT Weights FFT TILING Workspace WINOGRAD 250 600 Memory [MiB] (Total time) Time [ms] 200 400 150 100 200 50 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 conv fc conv fc conv fc (8 MiB) (64 MiB) (512 MiB) Figure 3: Memory consumption (bars) and computation time (line/points) of 4/26 AlexNet on P100-SXM2 with different workspace sizes (8, 64, 512 MiB).

  6. Background µ -cuDNN Performance evaluation Background • Idea : Loop splitting for the convolution’s outermost loop decreases workspace size (as well as computation efficiency) Algorithm 2 Pseudo-code of two-dimensional convolution. 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: for( w = 0 ; w < W ; w ++) // Width loop 4: for( c = 0 ; c < C ; c ++) // Input channel loop 5: for( v = 0 ; v < V ; v ++) // Kernel width loop 6: for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 5/26

  7. Background µ -cuDNN Performance evaluation Approach and Contribution • Approach : µ -cuDNN, a thin wrapper library for cuDNN, which • divides a mini-batch into “micro-batches” by applying loop splitting • is based on Dynamic Programming (DP) and Integer Lienar Programming (ILP) • provides a Python interface for high-level optimization Time Using GEMM-based convolution conv1 relu1 pool1 conv2 cuDNN N = 256 N = 256 N = 256 N = 256 Using FFT-based convolution conv1 conv1 relu1 pool1 conv2 µ -cuDNN N = 128 N = 128 N = 256 N = 256 N = 64 • Contribution : • 1.60x speedup for AlexNet on V100-SXM2 GPU • up to 4.54x speedup (1.60x on average) for DeepBench on V100-SXM2 GPU 6/26

  8. µ -cuDNN

  9. Background µ -cuDNN Performance evaluation µ -cuDNN - Software stack • µ -cuDNN is a wrapper library for cuDNN, which can be called by 1. a DL framework as low-level performance tuning library 2. its dedicated Python frontend for high-level performance analysis User code Python etc. 1. 2. DL Framework Python C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB Figure 4: µ -cuDNN software stack. 7/26

  10. Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t µ -cuDNN DL Framework for(i = 1..L) { ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

  11. Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

  12. Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

  13. Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26

  14. Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 4. µ -cuDNN internally calls the convolution function one or more times 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* 4. } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26

  15. Background µ -cuDNN Performance evaluation Workspace policies • µ -cuDNN employs one of two workspace utilization policies • Workspace Reuse (WR) : Each layer reuses a private workspace • Workspace Division (WD) : Each layer uses a part of an unified workspace WR WD Maximum total WS size O ( # of layer ) constant Optimizer DP DP+ILP WS owner DL framework µ -cuDNN 9/26

  16. µ -cuDNN WR User code Python etc. Python DL Framework C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB

  17. Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) 10/26

  18. Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (60) conv1 b = 60 T µ (60) Time 10/26

  19. Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (120) conv1 conv1 b = 60 b = 60 T µ (60) T µ (60) Time 10/26

  20. Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (256) conv1 conv1 conv1 conv1 conv1 b = 60 b = 60 b = 60 b = 60 b = 16 Time 10/26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend