GTC 2019 - San Jose
Filling the Performance Gap in Convolution Implementations for NVIDIA GPUs
Antonio J. Peña, Pedro Valero-Lara, Marc Jordà
www.bsc.es
Filling the Performance Gap in Convolution Implementations for - - PowerPoint PPT Presentation
www.bsc.es Filling the Performance Gap in Convolution Implementations for NVIDIA GPUs Antonio J. Pea, Pedro Valero-Lara, Marc Jord GTC 2019 - San Jose Agenda Intro Background Convolutjonal Neural Networks Convolutjon
www.bsc.es
2
3
AlexNet Structure
4
○ Weights are grouped in fjlters ○ Filters are shared by several output elements ○ Uses convolutjon operatjons as part of its computatjon
Trained fjlters in the 1st convolutjonal layer of AlexNet
○ Storage and computatjonal cost does not depend
○ Translatjon invariance
○ Filters are trained to detect relevant patuerns Fully-connected layer
Weights Input (fmatuened) Output (fmatuened)
Outi = ActjvatjonFunc(Sumj=0..#In(Wi,j · Inj) + bias)
= * * * Convolutjonal layer
Output = ActjvatjonFunc(ConvolutjonOps(Input, Filters) + bias)
5
○ Input and fjlter depth are equal ○ Difgerent input subvolume for each output element (Dark blue highlight)
○ Output depth = number of fjlters ○ Filter is translated over the X and Y dimensions
○ # of inputs (aka batch size, N) ○ Input X, Y size (H, W) ○ # of fjlters (Nf) ○ Filter X, Y size (aka receptjve fjeld, Hf, Wf) ○ Depth ○ Stride ○ Padding ○ Dilatjon
6
Input (5x5x3)
Filters (3x3x3) Output (3x3x2)
○ 1 input of 5x5x3 ○ 2 fjlters of 3x3x3 ○ Stride X and Y = 1 1 output of 3x3x2 (output Z is the number of fjlters)
7
○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth
Inputs’ shape at difgerent layer levels of the CNN
8
○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth
○ Input X/Y size reductjon is done with pooling layers
Zero-padding of half
to input
Wf 2 Wf
Pooling of 2x2 tjles halves the X & Y size
Pooling applies a reductjon operatjon (e.g. avg, max,…) to each tjle
9
○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth
○ Input X/Y size reductjon is done with pooling layers
○ 95% of all convolutjon confjguratjons
○ 1x1, 3x3, 5x5, …
Characteristjcs of convolutjonal layers with stride=1 in the selected CNNs
10
○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth
○ Input X/Y size reductjon is done with pooling layers
○ 95% of all convolutjon confjguratjons
○ 1x1, 3x3, 5x5, …
○ Reduce the depth of inputs to reduce the computatjonal cost of the following convolutjonal layer (with larger fjlters)
concat
1x1 pool 1x1 1x1 1x1 3x3 5x5
Inceptjon module from GoogleNet
11
and reshape the result
○ Filters matrix → fmatuened fjlters as rows ○ Inputs matrix → Elements of input subvolumes as columns (im2col in Matlab)
○ Can exploit existjng high-performance GEMM libs (MKL, cuBLAS, …)
○ Requires extra memory for intermediate matrices ○ Inputs’ intermediate matrix is larger than inputs themselves
Image from Chetlur et al., cuDNN: Effjcient primitjves for deep learning
12
○ Additjons are faster to execute than multjplicatjons
○ Used in fast FIR fjlter algorithms in signal processing ○ Inputs: g, d ○ Coeffjcient matrices: A, B, G
○ FFT + Transformatjon + Inverse FFT
As part of our study, we did a performance survey of cuDNN convolutjon algorithms
○ GEMM, Winograd, FFT ○ Total of 7 variants: 3 of GEMM (1 explicit input transformatjon, 2 implicit), 2 of Winograd, and 2 of FFT
○ Best algorithm is at Y=1 ○ X axis labels are <inputXY>-<batch size>-<fjlter XY>-<#fjlters>-<depth>
13
1 2
1x1 3x3 5x5
14
○ Explicit GEMM is +1.5x slower for most of the confjguratjons
15
5x5 3x3 Best is the other winograd variant, not shown to reduce clutuer
○ Initjally designed for this fjlter size
small and input X&Y size large
best performing
○ Betuer suited for larger fjlter sizes
17
At the layer level
○ Each fjlter is used with all the inputs ○ Each input is used with all the fjlters Inputs Filters Outputs
18
At the convolutjon level
○ Not constant: input z-rows in the center are reused more
○ Each fjlter z-row is reused the same amount of tjmes ○ Inputs are usually larger => more reuse of fjlter z-rows ○ If stride = 1 (common in CNNs), reuse is done by contjguous subvolume
At the layer level
○ Each fjlter is used with all the inputs ○ Each input is used with all the fjlters
Filter elements reuse: Input elements that reuse two example Z-rows of the fjlter (in matching colors) in a convolutjon with stride=1
19
○ Right-most dim elements are contjguous in memory
○ N: batch ○ C: depth ○ W: width ○ H: height
○ NCHW ○ NHWC
20
○ Fully-coalesced warps ○ Some warps may have a gap (overhead similar to misaligned accesses) ○ No need for layout transformatjons before the actual computatjon
○ Exploit shared mem and shuffme instructjons ○ Faster mem access
Example with warp size = 4
21
Scalar products
22
○ Output of 1st stage has to be stored in the correct layout
○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
23
24
cuDNN variant in 8.31% of the tested confjguratjons
○ Average speedup of 1.46x for these confjguratjons
○ Mainly in smaller batch sizes (up to 16) ○ DL frameworks pick the best algorithm for each convolutjonal layer
○ Our design betuer exploits thread block-level parallelism for small batch sizes ○ Too many thread blocks negatjvely impact our performance for large batch sizes ○ Compute & memory access units not fully utjlized
2 1
Speedup vs. Best cuDNN variant
○ Algorithm has to be adapted to the Tensor Cores matrix-matrix multjplicatjon API
○ Work-fusion (e.g. thread coarsening) optjmizatjons ○ Compute units utjlizatjon can increase (feedback from profjler) ○ Improve performance for larger batch and fjlter sizes
25
www.bsc.es