Accelerating Deep Learning Frameworks with Micro-batches Yosuke - PowerPoint PPT Presentation

Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 * oyama.y.aa@m.titech.ac.jp , Presenter

Background

Background µ -cuDNN Performance evaluation Background • Convolution is one of the key operations in Convolutional Neural Networks (CNNs) W W ′ C Algorithm 1 Pseudo-code of two-dimensional convolution. C ′ 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop U C H ′ for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: H V for( w = 0 ; w < W ; w ++) // Width loop 4: W for( c = 0 ; c < C ; c ++) // Input channel loop 5: Y for( v = 0 ; v < V ; v ++) // Kernel width loop 6: X for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Figure 1: 2D convolution. Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 2/26

Background µ -cuDNN Performance evaluation Background • NVIDIA cuDNN library provides deep learning primitives for GPUs • cuDNN provides several equivalent convolution algorithms GEMM-based Wingorad FFT-based X W Y X W Y X W Y im2col F G B T A T F F − 1 = = = X ′ · ˜ ◦ ˜ ˜ ˆ ◦ ˆ ˆ W Y X W Y X W Y Winograd domain Frequency domain : Workspace Figure 2: Three different convolution algorithms. 3/26

Background µ -cuDNN Performance evaluation Background • Problem statement : cuDNN may require a workspace as large as the network itself to use efficient convolution algorithms! 800 IMPLICIT PRECOMP GEMM Data 300 FFT Weights FFT TILING Workspace WINOGRAD 250 600 Memory [MiB] (Total time) Time [ms] 200 400 150 100 200 50 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 conv fc conv fc conv fc (8 MiB) (64 MiB) (512 MiB) Figure 3: Memory consumption (bars) and computation time (line/points) of 4/26 AlexNet on P100-SXM2 with different workspace sizes (8, 64, 512 MiB).

Background µ -cuDNN Performance evaluation Background • Idea : Loop splitting for the convolution’s outermost loop decreases workspace size (as well as computation efficiency) Algorithm 2 Pseudo-code of two-dimensional convolution. 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: for( w = 0 ; w < W ; w ++) // Width loop 4: for( c = 0 ; c < C ; c ++) // Input channel loop 5: for( v = 0 ; v < V ; v ++) // Kernel width loop 6: for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 5/26

Background µ -cuDNN Performance evaluation Approach and Contribution • Approach : µ -cuDNN, a thin wrapper library for cuDNN, which • divides a mini-batch into “micro-batches” by applying loop splitting • is based on Dynamic Programming (DP) and Integer Lienar Programming (ILP) • provides a Python interface for high-level optimization Time Using GEMM-based convolution conv1 relu1 pool1 conv2 cuDNN N = 256 N = 256 N = 256 N = 256 Using FFT-based convolution conv1 conv1 relu1 pool1 conv2 µ -cuDNN N = 128 N = 128 N = 256 N = 256 N = 64 • Contribution : • 1.60x speedup for AlexNet on V100-SXM2 GPU • up to 4.54x speedup (1.60x on average) for DeepBench on V100-SXM2 GPU 6/26

µ -cuDNN

Background µ -cuDNN Performance evaluation µ -cuDNN - Software stack • µ -cuDNN is a wrapper library for cuDNN, which can be called by 1. a DL framework as low-level performance tuning library 2. its dedicated Python frontend for high-level performance analysis User code Python etc. 1. 2. DL Framework Python C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB Figure 4: µ -cuDNN software stack. 7/26

Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t µ -cuDNN DL Framework for(i = 1..L) { ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26

Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26

Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 4. µ -cuDNN internally calls the convolution function one or more times 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* 4. } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26

Background µ -cuDNN Performance evaluation Workspace policies • µ -cuDNN employs one of two workspace utilization policies • Workspace Reuse (WR) : Each layer reuses a private workspace • Workspace Division (WD) : Each layer uses a part of an unified workspace WR WD Maximum total WS size O ( # of layer ) constant Optimizer DP DP+ILP WS owner DL framework µ -cuDNN 9/26

µ -cuDNN WR User code Python etc. Python DL Framework C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB

Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) 10/26

Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (60) conv1 b = 60 T µ (60) Time 10/26

Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (120) conv1 conv1 b = 60 b = 60 T µ (60) T µ (60) Time 10/26

Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (256) conv1 conv1 conv1 conv1 conv1 b = 60 b = 60 b = 60 b = 60 b = 16 Time 10/26

Accelerating Deep Learning Frameworks with Micro-batches Yosuke - PowerPoint PPT Presentation

Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 *

Outline Introduction Variation Among Batches Variation Within Batches Experimenting

SAS and UNIX: Rerunning your batches in a smarter way Yaroslav Haiovyi, Intego Group, Kharkiv,

SQL110 Transact SQL Essentials Scripts Doug Shook Scripts Series of SQL Statements

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

From Tap Water to Marine Organisms: a Micro-Spectroscopic Approach to Micro-Plastic

Trading Space for Place Micro-Loft Case Studies WHAT IS A MICRO-LOFT? Micro-Lofts are rental or

MICRO STANDING OFFER PROGRAM (MICRO-SOP) JUNE 9 VANCOUVER MEETING AND JUNE 10 WEBINAR June 9

BSN MICRO/i KREDIT PRIHATIN & SPECIAL RELIEF FACILITIES Micro Business Division RETAIL

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Accelerating drug discovery with deep neural networks literature review Tobias Sikosek Senior

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Matrix multiplication over word-size modular rings using Binis approximate formula Brice

CS 5150 So(ware Engineering Requirements Analysis William

where user experience and software engineering meet Andrew J. Ko

3/9/2020 The Virtual The Virtual The Virtual The Virtual Certification Certification

Design and Analysis of Algorithms

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti & Olga

ss s

Accelerating Deep Learning Frameworks with Micro-batches Yosuke - PowerPoint PPT Presentation

Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 *

Outline Introduction Variation Among Batches Variation Within Batches Experimenting

SAS and UNIX: Rerunning your batches in a smarter way Yaroslav Haiovyi, Intego Group, Kharkiv,

SQL110 Transact SQL Essentials Scripts Doug Shook Scripts Series of SQL Statements

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

From Tap Water to Marine Organisms: a Micro-Spectroscopic Approach to Micro-Plastic

Trading Space for Place Micro-Loft Case Studies WHAT IS A MICRO-LOFT? Micro-Lofts are rental or

MICRO STANDING OFFER PROGRAM (MICRO-SOP) JUNE 9 VANCOUVER MEETING AND JUNE 10 WEBINAR June 9

BSN MICRO/i KREDIT PRIHATIN &amp; SPECIAL RELIEF FACILITIES Micro Business Division RETAIL

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Accelerating drug discovery with deep neural networks literature review Tobias Sikosek Senior

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Matrix multiplication over word-size modular rings using Binis approximate formula Brice

CS 5150 So(ware Engineering Requirements Analysis William

where user experience and software engineering meet Andrew J. Ko

3/9/2020 The Virtual The Virtual The Virtual The Virtual Certification Certification

Design and Analysis of Algorithms

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti &amp; Olga

ss s

BSN MICRO/i KREDIT PRIHATIN & SPECIAL RELIEF FACILITIES Micro Business Division RETAIL

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti & Olga