CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC - - PowerPoint PPT Presentation

cuda optimization tips tricks and techniques
SMART_READER_LITE
LIVE PREVIEW

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC - - PowerPoint PPT Presentation

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 RULE #1: DONT TRY TOO HARD Performance Peak Performance Time 3 RULE #1: DONT TRY TOO HARD Performance Peak Performance


slide-1
SLIDE 1

Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES

slide-2
SLIDE 2

2

The art of doing more with less

slide-3
SLIDE 3

3

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance

slide-4
SLIDE 4

4

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Unrealistic Effort/Reward

slide-5
SLIDE 5

5

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance

slide-6
SLIDE 6

6

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Reduce this time Don’t waste this time Get on this curve

slide-7
SLIDE 7

7

RULE #1: DON’T TRY TOO HARD

Performance Time Peak Performance Trough of despair Point of diminishing returns Premature excitement Wait, it’s going slower?? Hire an intern Here be ninjas Most people give up here 4 weeks and this is it?

slide-8
SLIDE 8

8

PERFORMANCE CONSTRAINTS

Memory 75%

Occupancy 10% Instruction 2% Divergence 3% Compute Intensity 10%

slide-9
SLIDE 9

9

PERFORMANCE CONSTRAINTS

CPU <> GPU Transfer Coalescence Cache Inefficiency Register Spilling Divergent Access Occupancy Instruction Divergence Compute Intensity

Chart Title

slide-10
SLIDE 10

10

MEMORY ORDERS OF MAGNITUDE

CPU DRAM GDRAM L2 Cache L1$ SM 150 GB/sec 16 GB/sec 300 GB/sec 2,000 GB/sec 20,000 GB/sec

regs shmem

PCIe bus

regs shmem regs shmem

slide-11
SLIDE 11

11

TALK BREAKDOWN

1. Why Didn’t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & Shared Memory 5. Occupancy, Divergence & Latency 6. Weird Things You Never Thought Of (and probably shouldn’t try)

In no particular order

slide-12
SLIDE 12

12

WHERE TO BEGIN?

slide-13
SLIDE 13

13

THE OBVIOUS

Start with the Visual Profiler

NVIDIA Visual Profiler

slide-14
SLIDE 14

14

CPU <> GPU DATA MOVEMENT

slide-15
SLIDE 15

15

PCI ISSUES

regs shmem regs shmem regs shmem

PCIe bus 16 GB/sec

Moving data over the PCIe bus

slide-16
SLIDE 16

16

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data

Copy

slide-17
SLIDE 17

17

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data DMA Controller

slide-18
SLIDE 18

18

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Swap DMA Controller Data

slide-19
SLIDE 19

19

PIN YOUR CPU MEMORY

CPU Memory GPU Memory

Data DMA Controller

Pinned Copy of Data

CPU allocates & pins page then copies locally before DMA

slide-20
SLIDE 20

20

GPU Memory

PIN YOUR CPU MEMORY

CPU Memory

User Pinned Data

DMA Controller cudaHostAlloc( &data, size, cudaHostAllocMapped ); cudaHostRegister( &data, size, cudaHostRegisterDefault );

slide-21
SLIDE 21

21

PIN YOUR CPU MEMORY

slide-22
SLIDE 22

22

REMEMBER: PCIe GOES BOTH WAYS

slide-23
SLIDE 23

23

Operations in a single stream are ordered But hardware can copy and compute at the same time

STREAMS & CONCURRENCY

Compute Copy data to Host Copy data to GPU Time Single Stream

Hiding the cost of data transfer

slide-24
SLIDE 24

24

STREAMS & CONCURRENCY

Compute Copy data to Host Copy data to GPU Time Work Copy back Copy up Work Copy back Copy up Saved Time Stream 2 Stream 1 Single Stream

slide-25
SLIDE 25

25

STREAMS & CONCURRENCY

8 streams 2 streams 1 stream

Can keep on breaking work into smaller chunks and saving time

slide-26
SLIDE 26

26

SMALL PCIe TRANSFERS

PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go?

8 Too many 2 1

slide-27
SLIDE 27

27

APPARENTLY NOT THAT SMALL

slide-28
SLIDE 28

28

FROM GPU MEMORY TO GPU THREADS

slide-29
SLIDE 29

29

FEEDING THE MACHINE

regs shmem regs shmem regs shmem

PCIe bus

From GPU Memory to the SMs

slide-30
SLIDE 30

30

USE THE PARALLEL ARCHITECTURE

Cache is sized to service sets of 32 requests at a time

L2 Cache Line

Threads run in groups of 32 High-speed GPU memory works best with linear access

Hardware is optimized to use all SIMT threads at once

slide-31
SLIDE 31

31

VECTORIZE MEMORY LOADS

T0-T32

int

Multi-Word as well as Multi-Thread

slide-32
SLIDE 32

32

VECTORIZE MEMORY LOADS

T0-T15 T16-T31

int2

Fill multiple cache lines in a single fetch

slide-33
SLIDE 33

33

VECTORIZE MEMORY LOADS

T0-T7 T8-T15 T16-T23 T24-T31

int4

Fill multiple cache lines in a single fetch

slide-34
SLIDE 34

34

VECTORIZE MEMORY LOADS

slide-35
SLIDE 35

35

DO MULTIPLE LOADS PER THREAD

__global__ void copy(int2 *input, int2 *output, int max) { int id = threadIdx.x + blockDim.x * blockIdx.x; if( id < max ) {

  • utput[id] = input[id];

} } __global__ void copy(int2 *input, int2 *output, int max, int loadsPerThread) { int id = threadIdx.x + blockDim.x * blockIdx.x; for(int n=0; n<loadsPerThread; n++) { if( id >= max ) { break; }

  • utput[id] = input[id];

id += blockDim.x * gridDim.x; } }

One copy per thread

Maximum overhead

Multiple copies per thread

Amortize overhead

Multi-Thread, Multi-Word AND Multi-Iteration

slide-36
SLIDE 36

36

“MAXIMAL” LAUNCHES ARE BEST

slide-37
SLIDE 37

37

COALESCED MEMORY ACCESS

1 2 3 4

Coalesced: Sequential memory accesses are adjacent Uncoalesced: Sequential memory accesses are unassociated

1 2 3 4

It’s not just good enough to use all SIMT threads

slide-38
SLIDE 38

38

SIMT PENALTIES WHEN NOT COALESCED

x = data[threadIdx.x] x = data[rand()]

Single 32-wide operation 32 one-wide operations

slide-39
SLIDE 39

39

SCATTER & GATHER

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Scattering

Reading randomly Writing sequentially

Gathering

Reading sequentially Writing randomly

slide-40
SLIDE 40

40

AVOID SCATTER/GATHER IF YOU CAN

slide-41
SLIDE 41

41

AVOID SCATTER/GATHER IF YOU CAN

slide-42
SLIDE 42

42

SORTING MIGHT BE AN OPTION

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

1 2 3 4

Coalesced Read

1 2 3 4

Sort

1 2 3 4 2 4 1 3

Gathering Slow Fast

slide-43
SLIDE 43

43

SORTING MIGHT BE AN OPTION

Even if you’re only going to read it twice, then yes!

slide-44
SLIDE 44

44

PRE-SORTING TURNS OUT TO BE GOOD

slide-45
SLIDE 45

45

DATA LAYOUT: “AOS vs. SOA”

Array-of-Structures

#define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS gridData[NPTS]; #define NPTS 1024 *1024 struct Coefficients_SOA { double u[3][NPTS]; double x[3][3][NPTS]; double p[NPTS]; double rho[NPTS]; double eta[NPTS]; }; Coefficients_SOA gridData;

Structure-of-Arrays

Single-thread code prefers arrays of structures, for cache efficiency SIMT code prefers structures of arrays, for execution & memory efficiency

Sometimes you can’t just sort your data

slide-46
SLIDE 46

46

DATA LAYOUT: “AOS vs. SOA”

#define NPTS 1024 * 1024 struct Coefficients_AOS { double u[3]; double x[3][3]; double p; double rho; double eta; }; Coefficients_AOS gridData[NPTS];

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Structure Definition Conceptual Layout

slide-47
SLIDE 47

47

SOA: STRIDED ARRAY ACCESS

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout

double u0 = gridData[threadIdx.x].u[0];

GPU reads data one element at a time, but in parallel by 32 threads in a warp

slide-48
SLIDE 48

48

AOS: COALESCED BUT COMPLEX

u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout Array-of-Structures Memory Layout

GPU reads data one element at a time, but in parallel by 32 threads in a warp

double u0 = gridData.u[0][threadIdx.x];

Structure-of-Arrays Memory Layout

slide-49
SLIDE 49

49

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Block copies data to shared memory Device Memory Shared Memory

slide-50
SLIDE 50

50

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Threads which own the data grab it from shared memory Device Memory Shared Memory

slide-51
SLIDE 51

51

CLEVER AOS/SOA TRICKS

slide-52
SLIDE 52

52

CLEVER AOS/SOA TRICKS

Helps for any data size

slide-53
SLIDE 53

53

HANDY LIBRARY TO HELP YOU

Trove – A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove

slide-54
SLIDE 54

54

(AB)USING THE CACHE

slide-55
SLIDE 55

55

MAKING THE MOST OF L2-CACHE

L2 cache is fast but small:

GDRAM L2 Cache 300 GB/sec 2,000 GB/sec

Architecture L2 Cache Size Total Threads Cache Bytes per Thread Kepler 1536 KB 30,720 51 Maxwell 3072 KB 49,152 64 Pascal 4096 KB 114,688 36

slide-56
SLIDE 56

56

TRAINING DEEP NEURAL NETWORKS

slide-57
SLIDE 57

57

LOTS OF PASSES OVER DATA

FFT 3x3 convolution 5x5 convolution 7x7 convolution

+

W1 W2 W3 Cat!

slide-58
SLIDE 58

58

MULTI-RESOLUTION CONVOLUTIONS

Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7

slide-59
SLIDE 59

59

TILED, MULTI-RESOLUTION CONVOLUTION

Do 3 passes per-tile Each tile sized to fit in L2 cache

Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7

slide-60
SLIDE 60

60

LAUNCHING FEWER THAN MAXIMUM THREADS

slide-61
SLIDE 61

61

SHARED MEMORY: DEFINITELY WORTH IT

slide-62
SLIDE 62

62

USING SHARED MEMORY WISELY

Shared memory arranged into “banks” for concurrent SIMT access

▪ 32 threads can read simultaneously so long as into separate banks

Shared memory has 4-byte and 8-byte “bank” sizes

slide-63
SLIDE 63

63

STENCIL ALGORITHM

Many algorithms have high data re-use: potentially good for shared memory “Stencil” algorithms accumulate data from neighbours onto a central point

▪ Stencil has width “W” (in the above case, W=5)

Adjacent threads will share (W-1) items of data – good potential for data re-use

slide-64
SLIDE 64

64

STENCILS IN SHARED MEMORY

slide-65
SLIDE 65

65

SIZE MATTERS

slide-66
SLIDE 66

66

PERSISTENT KERNELS

Avoid multiple kernel launches by caching in shared memory instead of L2

void tiledConvolution() { convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); }

__global__ void convolutionShared(int *data, int count, int sharedelems) { extern __shared__ int shdata[]; shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x]; __syncthreads(); convolve<3>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<5>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<7>(threadIdx.x, shdata, sharedelems); }

Separate kernel launches with L2 re-use Single kernel launch with persistent kernel

Revisiting the tiled convolutions

slide-67
SLIDE 67

67

PERSISTENT KERNELS

slide-68
SLIDE 68

68

OPERATING DIRECTLY FROM CPU MEMORY

Can save memory copies. It’s obvious when you think about it ...

Compute Copy data to Host Copy data to GPU Compute

Read from CPU Write to CPU

Compute only begins when 1st copy has finished. Task only ends when 2nd copy has finished. Compute begins after first fetch. Uses lots of threads to cover host-memory access latency. Takes advantage of bi-directional PCI.

slide-69
SLIDE 69

69

OPERATING DIRECTLY FROM CPU MEMORY

slide-70
SLIDE 70

70

OCCUPANCY AND REGISTER LIMITATIONS

Register file is bigger than shared memory and L1 cache! Occupancy can kill you if you use too many registers Often worth forcing fewer registers to allow more blocks per SM But watch out for math functions!

Function float double

log 7 18 cos 16 28 acos 6 18 cosh 7 10 tan 15 28 erfc 14 22 exp 7 10 log10 6 18 normcdf 16 26 cbrt 8 20 sqrt 6 12 rsqrt 5 12 y0 20 30 y1 22 30 fdivide 11 20 pow 11 24

  • grad. desc.

14 22 __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) __global__ void compute() { y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); }

slide-71
SLIDE 71

THANK YOU!