Filling the Performance Gap in Convolution Implementations for - - PowerPoint PPT Presentation

filling the performance gap in convolution
SMART_READER_LITE
LIVE PREVIEW

Filling the Performance Gap in Convolution Implementations for - - PowerPoint PPT Presentation

www.bsc.es Filling the Performance Gap in Convolution Implementations for NVIDIA GPUs Antonio J. Pea, Pedro Valero-Lara, Marc Jord GTC 2019 - San Jose Agenda Intro Background Convolutjonal Neural Networks Convolutjon


slide-1
SLIDE 1

GTC 2019 - San Jose

Filling the Performance Gap in Convolution Implementations for NVIDIA GPUs

Antonio J. Peña, Pedro Valero-Lara, Marc Jordà

www.bsc.es

slide-2
SLIDE 2

Agenda

2

  • Intro
  • Background

○ Convolutjonal Neural Networks ○ Convolutjon operatjon ○ Common characteristjcs of CNNs

  • cuDNN convolutjon algorithms survey
  • Design

○ Data reuse present in conv layers ○ Data layout ○ Algorithm stages

  • Performance evaluatjon
  • Conclusions & ongoing work
slide-3
SLIDE 3

Introduction

3

AlexNet Structure

  • Interest in neural networks resurged in recent years

○ Deep Neural Networks (DNNs)

  • Made possible by

○ Availability of very large annotated datasets (e.g. imageNet) ○ High-throughput heterogeneous systems

  • Convolutjonal Neural Networks (CNNs)

○ High accuracy in image classifjcatjon benchmarks ○ Several algorithms (Direct, GEMM, FFT, Winograd)

  • Our convolutjon implementatjon for NVIDIA GPUs

○ Based on direct applicatjon of the convolutjon formula ○ Effjciently exploits in-core memories and global memory accesses

slide-4
SLIDE 4

Convolutional Neural Networks (CNNs)

4

  • Inclusion of convolutjonal layers
  • Convolutjonal layer

○ Weights are grouped in fjlters ○ Filters are shared by several output elements ○ Uses convolutjon operatjons as part of its computatjon

Trained fjlters in the 1st convolutjonal layer of AlexNet

  • Advantage over fully-connected layers

○ Storage and computatjonal cost does not depend

  • n input or output size
  • Number of fjlters and its size are a design choice

○ Translatjon invariance

  • Filters “see” difgerent parts of the input
  • Serves as automatjc feature extractor

○ Filters are trained to detect relevant patuerns Fully-connected layer

Weights Input (fmatuened) Output (fmatuened)

Outi = ActjvatjonFunc(Sumj=0..#In(Wi,j · Inj) + bias)

= * * * Convolutjonal layer

Output = ActjvatjonFunc(ConvolutjonOps(Input, Filters) + bias)

slide-5
SLIDE 5

Convolution Operation

5

  • Output elements are the scalar product of one fjlter

and a subvolume of the input

○ Input and fjlter depth are equal ○ Difgerent input subvolume for each output element (Dark blue highlight)

  • Output planes are the convolutjon of one input

with one of the fjlters

○ Output depth = number of fjlters ○ Filter is translated over the X and Y dimensions

  • Convolutjon parameters

○ # of inputs (aka batch size, N) ○ Input X, Y size (H, W) ○ # of fjlters (Nf) ○ Filter X, Y size (aka receptjve fjeld, Hf, Wf) ○ Depth ○ Stride ○ Padding ○ Dilatjon

slide-6
SLIDE 6

Convolution Operation - Example

6

Input (5x5x3)

*

=

Filters (3x3x3) Output (3x3x2)

  • Example convolutjon with 1 input and 2 fjlters

○ 1 input of 5x5x3 ○ 2 fjlters of 3x3x3 ○ Stride X and Y = 1 1 output of 3x3x2 (output Z is the number of fjlters)

slide-7
SLIDE 7

Convolution parameter values in CNNs

7

  • Parameters from 5 well-known CNNs

○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19

  • Overall structure

○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth

… … …

Inputs’ shape at difgerent layer levels of the CNN

slide-8
SLIDE 8

Convolution parameter values in CNNs

8

  • Parameters from 5 well-known CNNs

○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19

  • Overall structure

○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth

  • Padding to maintain X/Y size

○ Input X/Y size reductjon is done with pooling layers

Zero-padding of half

  • f the fjlter size keeps
  • utput X&Y size equal

to input

*

=

Wf 2 Wf

=

Pooling of 2x2 tjles halves the X & Y size

Pooling applies a reductjon operatjon (e.g. avg, max,…) to each tjle

slide-9
SLIDE 9

Convolution parameter values in CNNs

9

  • Parameters from 5 well-known CNNs

○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19

  • Overall structure

○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth

  • Padding to maintain X/Y size

○ Input X/Y size reductjon is done with pooling layers

  • Stride = 1 for most convolutjons

○ 95% of all convolutjon confjguratjons

  • Filter sizes are small

○ 1x1, 3x3, 5x5, …

Characteristjcs of convolutjonal layers with stride=1 in the selected CNNs

slide-10
SLIDE 10

Convolution parameter values in CNNs

10

  • Parameters from 5 well-known CNNs

○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19

  • Overall structure

○ Initjal layers have large input X/Y size, small depth ○ Final layers have small input X/Y size, large depth

  • Padding to maintain X/Y size

○ Input X/Y size reductjon is done with pooling layers

  • Stride = 1 for most convolutjons

○ 95% of all convolutjon confjguratjons

  • Filter sizes are small

○ 1x1, 3x3, 5x5, …

  • Convolutjons with 1x1 fjlters are a special case

○ Reduce the depth of inputs to reduce the computatjonal cost of the following convolutjonal layer (with larger fjlters)

concat

1x1 pool 1x1 1x1 1x1 3x3 5x5

Inceptjon module from GoogleNet

slide-11
SLIDE 11

Convolution algorithms in cuDNN

11

GEMM-based Algorithm

  • Generate two intermediate matrices, multjply them,

and reshape the result

○ Filters matrix → fmatuened fjlters as rows ○ Inputs matrix → Elements of input subvolumes as columns (im2col in Matlab)

  • Pros

○ Can exploit existjng high-performance GEMM libs (MKL, cuBLAS, …)

  • Cons

○ Requires extra memory for intermediate matrices ○ Inputs’ intermediate matrix is larger than inputs themselves

Image from Chetlur et al., cuDNN: Effjcient primitjves for deep learning

slide-12
SLIDE 12

Convolution algorithms in cuDNN

12

Arithmetjc strength reductjon approaches

  • Algorithmic transformatjon to trade multjplicatjons

by additjons

○ Additjons are faster to execute than multjplicatjons

  • Winograd

○ Used in fast FIR fjlter algorithms in signal processing ○ Inputs: g, d ○ Coeffjcient matrices: A, B, G

  • Fast Fourier Transform

○ FFT + Transformatjon + Inverse FFT

slide-13
SLIDE 13

As part of our study, we did a performance survey of cuDNN convolutjon algorithms

  • 3 convolutjon algorithms

○ GEMM, Winograd, FFT ○ Total of 7 variants: 3 of GEMM (1 explicit input transformatjon, 2 implicit), 2 of Winograd, and 2 of FFT

  • Convolutjon confjguratjons from well-known CNNs: AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19
  • cuDNN 6 on V100-SXM2 (volta)
  • Performance normalized to the best performing algorithm for each convolutjon confjguratjon

○ Best algorithm is at Y=1 ○ X axis labels are <inputXY>-<batch size>-<fjlter XY>-<#fjlters>-<depth>

cuDNN Convolution Algorithms – Performance survey

13

1 2

1x1 3x3 5x5

slide-14
SLIDE 14

cuDNN Convolution Algorithms – Performance survey

14

Convolutjon confjguratjons with 1x1 fjlters (only GEMM variants support this fjlter size)

  • The implicit variants clearly outperform explicit GEMM

○ Explicit GEMM is +1.5x slower for most of the confjguratjons

  • GEMM-implicit-precomp is betuer when the batch size is > 1
slide-15
SLIDE 15

cuDNN Convolution Algorithms – Performance survey

15

5x5 3x3 Best is the other winograd variant, not shown to reduce clutuer

Confjguratjons with 3x3 fjlters

  • Winograd is clearly the best

○ Initjally designed for this fjlter size

  • GEMM-impl-precomp
  • utperforms it when depth is

small and input X&Y size large

Confjguratjons with 5x5 fjlters

  • GEMM-impl-precomp is the

best performing

  • FFT gets close in a few cases
  • nly

○ Betuer suited for larger fjlter sizes

slide-16
SLIDE 16

Design

slide-17
SLIDE 17

The convolutjons of a convolutjonal layer expose two levels of data reuse

Design – Data reuse

17

At the layer level

  • A batch of inputs are convolved with all the layer fjlters

○ Each fjlter is used with all the inputs ○ Each input is used with all the fjlters Inputs Filters Outputs

*

=

slide-18
SLIDE 18

The convolutjons of a convolutjonal layer expose two levels of data reuse

Design – Data reuse

18

At the convolutjon level

  • Input elements reuse

○ Not constant: input z-rows in the center are reused more

  • Filter elements reuse

○ Each fjlter z-row is reused the same amount of tjmes ○ Inputs are usually larger => more reuse of fjlter z-rows ○ If stride = 1 (common in CNNs), reuse is done by contjguous subvolume

At the layer level

  • A batch of inputs are convolved with all the layer fjlters

○ Each fjlter is used with all the inputs ○ Each input is used with all the fjlters

Filter elements reuse: Input elements that reuse two example Z-rows of the fjlter (in matching colors) in a convolutjon with stride=1

slide-19
SLIDE 19

Design – Data layout

19

Flatuened representatjon of the 4-D tensors

  • How are data stored in memory
  • Denoted as a four letuer acronym, one letuer per dimension

○ Right-most dim elements are contjguous in memory

  • Dimensions

○ N: batch ○ C: depth ○ W: width ○ H: height

  • Common layouts in CNNs

○ NCHW ○ NHWC

slide-20
SLIDE 20

Design – Data layout

20

Considering data layout + data reuse + coalescing If we have

  • NCHW layout
  • Warps mapped along W dimension
  • Stride = 1

We get

  • Good coalescing loading inputs

○ Fully-coalesced warps ○ Some warps may have a gap (overhead similar to misaligned accesses) ○ No need for layout transformatjons before the actual computatjon

  • Threads in a warp reuse fjlter data

○ Exploit shared mem and shuffme instructjons ○ Faster mem access

Example with warp size = 4

slide-21
SLIDE 21

Computatjon is split into 2 stages:

Design – Algorithm

21

1 .- Compute the scalar products between input & fjlter Z-rows required for the convolutjons

  • Exploits the reuse of fjlter elements in shared memory and

registers

Scalar products

slide-22
SLIDE 22

Computatjon is split into 2 stages:

Design – Algorithm

22

1 .- Compute the scalar products between input & fjlter Z-rows required for the convolutjons

  • Exploits the reuse of fjlter elements in shared memory and

registers 2 .- Add the partjal results matrices from the 1st stage to obtain each

  • utput X-Y plane.
  • Each output element is the sum of one element from each

partjal results matrix

  • Not necessary for convolutjons with 1x1 fjlters

○ Output of 1st stage has to be stored in the correct layout

slide-23
SLIDE 23

Evaluatjon dataset

  • 602 convolutjon confjguratjons (X & Y sizes, #fjlters, depth), from

○ AlexNet, GoogleNet, Resnet50, SqueezeNet, VGG19

  • Several input batch sizes: 1, 8, 16, 32, 64, 128, 256
  • Total 4000+ confjguratjons
  • Single-precision fmoatjng point
  • Average of 9 executjons

Experimental platgorm

  • IBM POWER9 server
  • V100-SXM2 (Volta) GPU
  • Red Hat Enterprise Linux Server 7.4
  • CUDA 9.2
  • cuDNN 7.1

Experimental Evaluation

23

slide-24
SLIDE 24

Results

24

  • Overall, our implementatjon is faster than the best

cuDNN variant in 8.31% of the tested confjguratjons

○ Average speedup of 1.46x for these confjguratjons

○ Mainly in smaller batch sizes (up to 16) ○ DL frameworks pick the best algorithm for each convolutjonal layer

  • Insights from performance profjling

○ Our design betuer exploits thread block-level parallelism for small batch sizes ○ Too many thread blocks negatjvely impact our performance for large batch sizes ○ Compute & memory access units not fully utjlized

2 1

Speedup vs. Best cuDNN variant

slide-25
SLIDE 25

Our implementatjon is competjtjve for certain parameter intervals

  • Convolutjons with 1x1 fjlters and small batch sizes
  • Speedups of up to 2.29x

Improvements currently in progress

  • Support for Tensor Cores for FP16 convolutjons

○ Algorithm has to be adapted to the Tensor Cores matrix-matrix multjplicatjon API

  • Obtain a betuer work distributjon among thread blocks

○ Work-fusion (e.g. thread coarsening) optjmizatjons ○ Compute units utjlizatjon can increase (feedback from profjler) ○ Improve performance for larger batch and fjlter sizes

Conclusions & Future work

25

slide-26
SLIDE 26

GTC 2019 - San Jose

Filling the Performance Gap in Convolution Implementations for NVIDIA GPUs

Antonio J. Peña, Pedro Valero-Lara, Marc Jordà

www.bsc.es

For further info: marc.jorda@bsc.es