[PPT] - CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC PowerPoint Presentation

SLIDE 1

Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES

SLIDE 2

2

The art of doing more with less

SLIDE 3

3

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance

SLIDE 4

4

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Unrealistic Effort/Reward

SLIDE 5

5

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance

SLIDE 6

6

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Reduce this time Don’t waste this time Get on this curve

SLIDE 7

7

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Trough of despair Point of diminishing returns Premature excitement Wait, it’s going slower?? Hire an intern Here be ninjas Most people give up here 4 weeks and this is it?

SLIDE 8

8

PERFORMANCE CONSTRAINTS

Memory 75%

Occupancy 10% Instruction 2% Divergence 3% Compute Intensity 10%

SLIDE 9

9

PERFORMANCE CONSTRAINTS

CPU <> GPU Transfer Coalescence Cache Inefficiency Register Spilling Divergent Access Occupancy Instruction Divergence Compute Intensity

Chart Title

SLIDE 10

10

MEMORY ORDERS OF MAGNITUDE

CPU DRAM GDRAM L2 Cache L1$ SM 150 GB/sec 16 GB/sec 300 GB/sec 2,000 GB/sec 20,000 GB/sec

regs shmem

PCIe bus

regs shmem regs shmem

SLIDE 11

11

TALK BREAKDOWN

1. Why Didn’t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & Shared Memory 5. Occupancy, Divergence & Latency 6. Weird Things You Never Thought Of (and probably shouldn’t try)

In no particular order

SLIDE 12

12

WHERE TO BEGIN?

SLIDE 13

13

THE OBVIOUS

Start with the Visual Profiler

NVIDIA Visual Profiler

SLIDE 14

14

CPU <> GPU DATA MOVEMENT

SLIDE 15

15

PCI ISSUES

regs shmem regs shmem regs shmem

PCIe bus 16 GB/sec

Moving data over the PCIe bus

SLIDE 16

16

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data

Copy

SLIDE 17

17

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data DMA Controller

SLIDE 18

18

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Swap DMA Controller Data

SLIDE 19

19

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data DMA Controller

Pinned Copy of Data

CPU allocates & pins page then copies locally before DMA

SLIDE 20

20

GPU Memory

PIN YOUR CPU MEMORY

CPU Memory

User Pinned Data

DMA Controller cudaHostAlloc( &data, size, cudaHostAllocMapped ); cudaHostRegister( &data, size, cudaHostRegisterDefault );

SLIDE 21

21

PIN YOUR CPU MEMORY

SLIDE 22

22

REMEMBER: PCIe GOES BOTH WAYS

SLIDE 23

23

Operations in a single stream are ordered But hardware can copy and compute at the same time

STREAMS & CONCURRENCY

Compute Copy data to Host Copy data to GPU Time Single Stream

Hiding the cost of data transfer

SLIDE 24

24

STREAMS & CONCURRENCY

Compute Copy data to Host Copy data to GPU Time Work Copy back Copy up Work Copy back Copy up Saved Time Stream 2 Stream 1 Single Stream

SLIDE 25

25

STREAMS & CONCURRENCY

8 streams 2 streams 1 stream

Can keep on breaking work into smaller chunks and saving time

SLIDE 26

26

SMALL PCIe TRANSFERS

PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go?

8 Too many 2 1

SLIDE 27

27

APPARENTLY NOT THAT SMALL

SLIDE 28

28

FROM GPU MEMORY TO GPU THREADS

SLIDE 29

29

FEEDING THE MACHINE

regs shmem regs shmem regs shmem

PCIe bus

From GPU Memory to the SMs

SLIDE 30

30

USE THE PARALLEL ARCHITECTURE

Cache is sized to service sets of 32 requests at a time

L2 Cache Line

Threads run in groups of 32 High-speed GPU memory works best with linear access

Hardware is optimized to use all SIMT threads at once

SLIDE 31

31

VECTORIZE MEMORY LOADS

T0-T32

int

Multi-Word as well as Multi-Thread

SLIDE 32

32

VECTORIZE MEMORY LOADS

T0-T15 T16-T31

int2

Fill multiple cache lines in a single fetch

SLIDE 33

33

VECTORIZE MEMORY LOADS

T0-T7 T8-T15 T16-T23 T24-T31

int4

Fill multiple cache lines in a single fetch

SLIDE 34

34

VECTORIZE MEMORY LOADS

SLIDE 35

35

DO MULTIPLE LOADS PER THREAD

__global__ void copy(int2 *input, int2 *output, int max) { int id = threadIdx.x + blockDim.x * blockIdx.x; if( id < max ) {

utput[id] = input[id];

} } __global__ void copy(int2 *input, int2 *output, int max, int loadsPerThread) { int id = threadIdx.x + blockDim.x * blockIdx.x; for(int n=0; n<loadsPerThread; n++) { if( id >= max ) { break; }

utput[id] = input[id];

id += blockDim.x * gridDim.x; } }

One copy per thread

Maximum overhead

Multiple copies per thread

Amortize overhead

Multi-Thread, Multi-Word AND Multi-Iteration

SLIDE 36

36

“MAXIMAL” LAUNCHES ARE BEST

SLIDE 37

37

COALESCED MEMORY ACCESS

1 2 3 4

Coalesced: Sequential memory accesses are adjacent Uncoalesced: Sequential memory accesses are unassociated

1 2 3 4

It’s not just good enough to use all SIMT threads

SLIDE 38

38

SIMT PENALTIES WHEN NOT COALESCED

x = data[threadIdx.x] x = data[rand()]

Single 32-wide operation 32 one-wide operations

SLIDE 39

39

SCATTER & GATHER

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Scattering

Reading randomly Writing sequentially

Gathering

Reading sequentially Writing randomly

SLIDE 40

40

AVOID SCATTER/GATHER IF YOU CAN

SLIDE 41

41

AVOID SCATTER/GATHER IF YOU CAN

SLIDE 42

42

SORTING MIGHT BE AN OPTION

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

1 2 3 4

Coalesced Read

1 2 3 4

Sort

1 2 3 4 2 4 1 3

Gathering Slow Fast

SLIDE 43

43

SORTING MIGHT BE AN OPTION

Even if you’re only going to read it twice, then yes!

SLIDE 44

44

PRE-SORTING TURNS OUT TO BE GOOD

SLIDE 45

45

DATA LAYOUT: “AOS vs. SOA”

Array-of-Structures

#define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS gridData[NPTS]; #define NPTS 1024 *1024 struct Coefficients_SOA { double u[3][NPTS]; double x[3][3][NPTS]; double p[NPTS]; double rho[NPTS]; double eta[NPTS]; }; Coefficients_SOA gridData;

Structure-of-Arrays

Single-thread code prefers arrays of structures, for cache efficiency SIMT code prefers structures of arrays, for execution & memory efficiency

Sometimes you can’t just sort your data

SLIDE 46

46

DATA LAYOUT: “AOS vs. SOA”

#define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS gridData[NPTS];

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Structure Definition Conceptual Layout

SLIDE 47

47

SOA: STRIDED ARRAY ACCESS

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout

double u0 = gridData[threadIdx.x].u[0];

GPU reads data one element at a time, but in parallel by 32 threads in a warp

SLIDE 48

48

AOS: COALESCED BUT COMPLEX

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout

GPU reads data one element at a time, but in parallel by 32 threads in a warp

double u0 = gridData.u[0][threadIdx.x];

Structure-of-Arrays Memory Layout

SLIDE 49

49

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Block copies data to shared memory Device Memory Shared Memory

SLIDE 50

50

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Threads which own the data grab it from shared memory Device Memory Shared Memory

SLIDE 51

51

CLEVER AOS/SOA TRICKS

SLIDE 52

52

CLEVER AOS/SOA TRICKS

Helps for any data size

SLIDE 53

53

HANDY LIBRARY TO HELP YOU

Trove – A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove

SLIDE 54

54

(AB)USING THE CACHE

SLIDE 55

55

MAKING THE MOST OF L2-CACHE

L2 cache is fast but small:

GDRAM L2 Cache 300 GB/sec 2,000 GB/sec

Architecture L2 Cache Size Total Threads Cache Bytes per Thread Kepler 1536 KB 30,720 51 Maxwell 3072 KB 49,152 64 Pascal 4096 KB 114,688 36

SLIDE 56

56

TRAINING DEEP NEURAL NETWORKS

SLIDE 57

57

LOTS OF PASSES OVER DATA

FFT 3x3 convolution 5x5 convolution 7x7 convolution

+

W1 W2 W3 Cat!

SLIDE 58

58

MULTI-RESOLUTION CONVOLUTIONS

Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7

SLIDE 59

59

TILED, MULTI-RESOLUTION CONVOLUTION

Do 3 passes per-tile Each tile sized to fit in L2 cache

Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7

SLIDE 60

60

LAUNCHING FEWER THAN MAXIMUM THREADS

SLIDE 61

61

SHARED MEMORY: DEFINITELY WORTH IT

SLIDE 62

62

USING SHARED MEMORY WISELY

Shared memory arranged into “banks” for concurrent SIMT access

▪ 32 threads can read simultaneously so long as into separate banks

Shared memory has 4-byte and 8-byte “bank” sizes

SLIDE 63

63

STENCIL ALGORITHM

Many algorithms have high data re-use: potentially good for shared memory “Stencil” algorithms accumulate data from neighbours onto a central point

▪ Stencil has width “W” (in the above case, W=5)

Adjacent threads will share (W-1) items of data – good potential for data re-use

SLIDE 64

64

STENCILS IN SHARED MEMORY

SLIDE 65

65

SIZE MATTERS

SLIDE 66

66

PERSISTENT KERNELS

Avoid multiple kernel launches by caching in shared memory instead of L2

void tiledConvolution() { convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); }

__global__ void convolutionShared(int *data, int count, int sharedelems) { extern __shared__ int shdata[]; shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x]; __syncthreads(); convolve<3>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<5>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<7>(threadIdx.x, shdata, sharedelems); }

Separate kernel launches with L2 re-use Single kernel launch with persistent kernel

Revisiting the tiled convolutions

SLIDE 67

67

PERSISTENT KERNELS

SLIDE 68

68

OPERATING DIRECTLY FROM CPU MEMORY

Can save memory copies. It’s obvious when you think about it ...

Compute Copy data to Host Copy data to GPU Compute

Read from CPU Write to CPU

Compute only begins when 1st copy has finished. Task only ends when 2nd copy has finished. Compute begins after first fetch. Uses lots of threads to cover host-memory access latency. Takes advantage of bi-directional PCI.

SLIDE 69

69

OPERATING DIRECTLY FROM CPU MEMORY

SLIDE 70

70

OCCUPANCY AND REGISTER LIMITATIONS

Register file is bigger than shared memory and L1 cache! Occupancy can kill you if you use too many registers Often worth forcing fewer registers to allow more blocks per SM But watch out for math functions!

Function float double

log 7 18 cos 16 28 acos 6 18 cosh 7 10 tan 15 28 erfc 14 22 exp 7 10 log10 6 18 normcdf 16 26 cbrt 8 20 sqrt 6 12 rsqrt 5 12 y0 20 30 y1 22 30 fdivide 11 20 pow 11 24

grad. desc.

14 22 __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) __global__ void compute() { y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); }

SLIDE 71

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES

RULE #1: DON’T TRY TOO HARD

RULE #1: DON’T TRY TOO HARD

RULE #1: DON’T TRY TOO HARD

RULE #1: DON’T TRY TOO HARD

RULE #1: DON’T TRY TOO HARD

PERFORMANCE CONSTRAINTS

PERFORMANCE CONSTRAINTS

MEMORY ORDERS OF MAGNITUDE

TALK BREAKDOWN

In no particular order

WHERE TO BEGIN?

THE OBVIOUS

CPU <> GPU DATA MOVEMENT

PCI ISSUES

PIN YOUR CPU MEMORY

PIN YOUR CPU MEMORY

PIN YOUR CPU MEMORY

PIN YOUR CPU MEMORY

PIN YOUR CPU MEMORY

PIN YOUR CPU MEMORY

REMEMBER: PCIe GOES BOTH WAYS

STREAMS & CONCURRENCY

Hiding the cost of data transfer

STREAMS & CONCURRENCY

STREAMS & CONCURRENCY

SMALL PCIe TRANSFERS

APPARENTLY NOT THAT SMALL

FROM GPU MEMORY TO GPU THREADS

FEEDING THE MACHINE

USE THE PARALLEL ARCHITECTURE

Hardware is optimized to use all SIMT threads at once

VECTORIZE MEMORY LOADS

Multi-Word as well as Multi-Thread

VECTORIZE MEMORY LOADS

VECTORIZE MEMORY LOADS

VECTORIZE MEMORY LOADS

DO MULTIPLE LOADS PER THREAD

Multi-Thread, Multi-Word AND Multi-Iteration

“MAXIMAL” LAUNCHES ARE BEST

COALESCED MEMORY ACCESS

It’s not just good enough to use all SIMT threads

SIMT PENALTIES WHEN NOT COALESCED

SCATTER & GATHER

AVOID SCATTER/GATHER IF YOU CAN

AVOID SCATTER/GATHER IF YOU CAN

SORTING MIGHT BE AN OPTION

SORTING MIGHT BE AN OPTION

PRE-SORTING TURNS OUT TO BE GOOD

DATA LAYOUT: “AOS vs. SOA”

Sometimes you can’t just sort your data

DATA LAYOUT: “AOS vs. SOA”

SOA: STRIDED ARRAY ACCESS

AOS: COALESCED BUT COMPLEX

BLOCK-WIDE LOAD VIA SHARED MEMORY

BLOCK-WIDE LOAD VIA SHARED MEMORY

CLEVER AOS/SOA TRICKS

CLEVER AOS/SOA TRICKS

Helps for any data size

HANDY LIBRARY TO HELP YOU

(AB)USING THE CACHE

MAKING THE MOST OF L2-CACHE

TRAINING DEEP NEURAL NETWORKS

LOTS OF PASSES OVER DATA

+

MULTI-RESOLUTION CONVOLUTIONS

TILED, MULTI-RESOLUTION CONVOLUTION

LAUNCHING FEWER THAN MAXIMUM THREADS

SHARED MEMORY: DEFINITELY WORTH IT

USING SHARED MEMORY WISELY

STENCIL ALGORITHM

STENCILS IN SHARED MEMORY

SIZE MATTERS

PERSISTENT KERNELS

Revisiting the tiled convolutions

PERSISTENT KERNELS

OPERATING DIRECTLY FROM CPU MEMORY

OPERATING DIRECTLY FROM CPU MEMORY

OCCUPANCY AND REGISTER LIMITATIONS

THANK YOU!