Comment on bitonic merging; more CUDA performance tuning CSE 6230: - - PowerPoint PPT Presentation

comment on bitonic merging more cuda performance tuning
SMART_READER_LITE
LIVE PREVIEW

Comment on bitonic merging; more CUDA performance tuning CSE 6230: - - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.


slide-1
SLIDE 1

Comment on bitonic merging; more CUDA performance tuning

CSE 6230: HPC Tools & Apps Tu Sep 18, 2012

Tuesday, September 18, 12

slide-2
SLIDE 2

๏ Comment on bitonic merging, including ideas & hints for Lab 3

Note: Some figures taken from Grama et al. book (2003) http://www-users.cs.umn.edu/~karypis/parbook/ This book is also available online through the GT library – see our course website.

Tuesday, September 18, 12

slide-3
SLIDE 3

Source: Grama et al. (2003)

Tuesday, September 18, 12

slide-4
SLIDE 4

Summary so far: bitonicMerge (bitonic sequence) == sorted Q: How do we get a bitonic sequence?

Tuesday, September 18, 12

slide-5
SLIDE 5

Source: Grama et al. (2003)

Tuesday, September 18, 12

slide-6
SLIDE 6

Source: Grama et al. (2003)

“⊕” = (min, max) “⊖” = (max, min)

Tuesday, September 18, 12

slide-7
SLIDE 7

Source: Grama et al. (2003)

“⊕” = (min, max) “⊖” = (max, min)

Tuesday, September 18, 12

slide-8
SLIDE 8

Source: Grama et al. (2003)

“⊕” = (min, max) “⊖” = (max, min)

Tuesday, September 18, 12

slide-9
SLIDE 9

Source: Grama et al. (2003)

“⊕” = (min, max) “⊖” = (max, min)

Tuesday, September 18, 12

slide-10
SLIDE 10

Source: Grama et al. (2003)

Tuesday, September 18, 12

slide-11
SLIDE 11

Source: Grama et al. (2003)

Tuesday, September 18, 12

slide-12
SLIDE 12

Bitonic sort parallel complexity (work-depth)?

Tuesday, September 18, 12

slide-13
SLIDE 13

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

Tuesday, September 18, 12

slide-14
SLIDE 14

Block Layout (p=4)

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (n/p) steps: No comm log p steps: Comm req’d

Tuesday, September 18, 12

slide-15
SLIDE 15

Block Layout (p=4)

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (n/p) steps: No comm log p steps: Comm req’d

rounds of communication = O(log n) number of pairwise exchanges per round = O(P) words sent per exchange = O(n/P) total words sent = O(n log n)

Tuesday, September 18, 12

slide-16
SLIDE 16

Block Layout (p=4)

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (n/p) steps: No comm log p steps: Comm req’d

Tuesday, September 18, 12

slide-17
SLIDE 17

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (p): Comm req’d log (n/p): No comm Cyclic Layout (p=4)

Tuesday, September 18, 12

slide-18
SLIDE 18

These (block or cyclic) examples are binary exchange algorithms. Question: Can we get the “best” of these two schemes?

Tuesday, September 18, 12

slide-19
SLIDE 19

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …

Tuesday, September 18, 12

slide-20
SLIDE 20

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …

rounds of communication = 1 number of pairwise exchanges per round = O(P2) words sent per exchange = O(n/P2) total words sent = O(n)

Tuesday, September 18, 12

slide-21
SLIDE 21

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111

log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …

Tuesday, September 18, 12

slide-22
SLIDE 22

4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Cyclic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block

All-to-all exchange

Matrix transpose

Tuesday, September 18, 12

slide-23
SLIDE 23

“Binary exchange” algorithm (block or cyclic): “Transpose” algorithm (cyclic → all-to-all → block):

rounds of communication = 1 number of pairwise exchanges per round = O(P2) total number of pairwise exchanges = O(P2) words sent per exchange = O(n/P2) total words sent = O(n) rounds of communication = O(log n) number of pairwise exchanges per round = O(P) total number of pairwise exchanges = O(P log n) words sent per exchange = O(n/P) total words sent = O(n log n)

Tuesday, September 18, 12

slide-24
SLIDE 24

๏ More CUDA tuning: Occupancy and ILP

References:

http://developer.nvidia.com/cuda/get-started-cuda-cc http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf

Tuesday, September 18, 12

slide-25
SLIDE 25

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-26
SLIDE 26

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-27
SLIDE 27

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-28
SLIDE 28

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-29
SLIDE 29

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-30
SLIDE 30

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-31
SLIDE 31

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-32
SLIDE 32

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-33
SLIDE 33

/opt/cuda-4.0/cuda/bin/nvcc -arch=sm_20 --ptxas-options=-v -O3 \

  • o bitmerge-cuda.o -c bitmerge-cuda.cu

ptxas info : Compiling entry function '_Z12bitonicSplitjPfj' for 'sm_20' ptxas info : Function properties for _Z12bitonicSplitjPfj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers, 52 bytes cmem[0]

icpc -O3 -g -o bitmerge timer.o bitmerge.o bitmerge-seq.o \ bitmerge-cilk.o bitmerge-cuda.o \

  • L/opt/cuda-4.0/cuda/bin/../lib64 \
  • Wl,-rpath /opt/cuda-4.0/cuda/bin/../lib64 -lcudart

Tuesday, September 18, 12

slide-34
SLIDE 34

Occupancy Limiters: Registers

Register usage: compile with --ptxas-options=-v Fermi has 32K registers per SM Example 1

Kernel uses 20 registers per thread (+1 implicit) Active threads = 32K/21 = 1560 threads

> 1536 thus an occupancy of 1

Example 2

Kernel uses 63 registers per thread (+1 implicit) Active threads = 32K/64 = 512 threads 512/1536 = .3333 occupancy

Can control register usage with the nvcc flag: --maxrregcount

Occupancy = (Active warps) / (Max active warps)

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-35
SLIDE 35

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-36
SLIDE 36

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-37
SLIDE 37

https://piazza.com/class#fall2012/cse6230/52

Tuesday, September 18, 12

slide-38
SLIDE 38

Recall: Reduction example

Tuesday, September 18, 12

slide-39
SLIDE 39

Recall: Reduction example

Tuesday, September 18, 12

slide-40
SLIDE 40

Recall: Reduction example

Tuesday, September 18, 12

slide-41
SLIDE 41

Recall: Reduction example

b = 256 threads/block ⇒ shmem = 256 * (4 Bytes/int) = 1024 Bytes

Tuesday, September 18, 12

slide-42
SLIDE 42

Occupancy Limiters: Shared Memory

Shared memory usage: compile with --ptxas-options=-v

Reports shared memory per block

Fermi has either 16K or 48K shared memory Example 1, 48K shared memory

Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536 threads

  • ccupancy=1

Example 2, 16K shared memory

Kernel uses 32 bytes of shared memory per thread 16K/32 = 512 threads

  • ccupancy=.3333

Don’t use too much shared memory Choose L1/Shared config appropriately.

Occupancy = (Active warps) / (Max active warps)

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-43
SLIDE 43

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-44
SLIDE 44

Occupancy

Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block

Resources are finite Utilizing too many resources per thread may limit the occupancy

Potential occupancy limiters:

Register usage Shared memory usage Block size

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-45
SLIDE 45

Occupancy Limiter: Block Size

Each SM can have up to 8 active blocks A small block size will limit the total number of threads Avoid small block sizes, generally 128-256 threads is sufficient

Block Size Active Threads Occupancy 32 256 .1666 64 512 .3333 128 1024 .6666 192 1536 1 256 2048 (1536) 1

Occupancy = (Active warps) / (Max active warps)

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Tuesday, September 18, 12

slide-46
SLIDE 46

Occupancy Limiter: Block Size

Each SM can have up to 8 active blocks A small block size will limit the total number of threads Avoid small block sizes, generally 128-256 threads is sufficient

Block Size Active Threads Occupancy 32 256 .1666 64 512 .3333 128 1024 .6666 192 1536 1 256 2048 (1536) 1

Occupancy = (Active warps) / (Max active warps)

Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp

Fermi-specific

Tuesday, September 18, 12

slide-47
SLIDE 47

What Occupancy Do I Need?

Depends on your problem…

Many find 66% is enough to saturate the bandwidth

Look at increasing occupancy only if the following are true!

The kernel is bandwidth bound The achieved bandwidth is significantly less than peak

Instruction Level Parallelism (ILP) can have a greater effect than increasing occupancy

Vasily Volkov’s GTC2010 talk “Better Performance at Lower Occupancy” http://nvidia.fullviewmedia.com/gtc2010/0922-a5-2238.html

Tuesday, September 18, 12

slide-48
SLIDE 48

Cuda Occupancy Calculator

A tool to help you investigate occupancy http://developer.download.nvidia.com/compute/cuda/4_0/sdk/doc s/CUDA_Occupancy_Calculator.xls Demo: CUDA_Occupancy_calculator.xls

Tuesday, September 18, 12

slide-49
SLIDE 49

Better Performance at Lower Occupancy

Vasily Volkov UC Berkeley September 22, 2010

1

Tuesday, September 18, 12

slide-50
SLIDE 50

It is common to recommend:

  • running more threads per multiprocessor
  • running more threads per thread block

Motivation: this is the only way to hide latencies

  • But…

2

Prologue

Tuesday, September 18, 12

slide-51
SLIDE 51

Faster codes run at lower occupancy:

CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower occupancy Performance (G80) 45 Gflop/s 93 Gflop/s 2x higher performance CUBLAS 1.1 CUBLAS 2.0 Threads per block 512 64 8x smaller thread blocks Occupancy (G80) 67% 33% 2x lower occupancy Performance (G80) 128 Gflop/s 204 Gflop/s 1.6x higher performance

Batch of 1024-point complex-to-complex FFTs, single precision: Multiplication of two large matrices, single precision (SGEMM):

3

Maximizing occupancy, you may lose performance

Tuesday, September 18, 12

slide-52
SLIDE 52

Two common fallacies:

‒ multithreading is the only way to hide latency on GPU ‒ shared memory is as fast as registers

4

Tuesday, September 18, 12

slide-53
SLIDE 53

x = a + b;// takes ≈20 cycles to execute y = a + c;// independent, can start anytime (stall) z = x + d;// dependent, must wait for completion

Arithmetic latency

Latency: time required to perform an operation

‒ ≈20 cycles for arithmetic; 400+ cycles for memory ‒ Can’t ¡start ¡a ¡dependent operation for this time ‒ Can hide it by overlapping with other operations

7

Tuesday, September 18, 12

slide-54
SLIDE 54

Arithmetic throughput

8

Latency is often confused with throughput

‒ E.g. ¡“arithmetic ¡is ¡100x ¡faster ¡than ¡memory ¡– costs 4 cycles per ¡warp ¡(G80), ¡whence ¡memory ¡operation ¡costs ¡400 ¡cycles”

‒ One is rate, another is time

Throughput: how many operations complete per cycle

‒ Arithmetic: 1.3 Tflop/s = 480 ops/cycle (op=multiply-add) ‒ Memory: 177 GB/s ≈ ¡32 ¡ops/cycle ¡(op=32-bit load)

Tuesday, September 18, 12

slide-55
SLIDE 55

Why parallelism? Reason 2: Time to move data

Little’s Law (queuing theory) explains how concurrency helps to hide latency.

Bandwidth Latency

Recall:

Tuesday, September 18, 12

slide-56
SLIDE 56

Why parallelism? Reason 2: Time to move data

Little’s Law (queuing theory) explains how concurrency helps to hide latency.

Bandwidth Latency

Little’s Law Concurrency = Latency × Bandwidth

Historical note Latency halves ~ 9 years Bandwidth doubles ~ 3 years

Recall:

Tuesday, September 18, 12

slide-57
SLIDE 57

Arithmetic parallelism in numbers

11

GPU model Latency (cycles) Throughput (cores/SM) Parallelism (operations/SM) G80-GT200 ≈24 8 ≈192 GF100 ≈18 32 ≈576 GF104 ≈18 48 ≈864

(latency varies between different types of ops) Can’t ¡get ¡100% ¡throughput ¡with ¡less ¡parallelism ‒ Not enough operations in the flight = idle cycles

Tuesday, September 18, 12

slide-58
SLIDE 58

Thread-level parallelism (TLP)

12

It is usually recommended to use threads to supply the needed parallelism, e.g. 192 threads per SM on G80:

x = x + a x = x + b x = x + c y = y + a y = y + b y = y + c thread 1 thread 2 thread 3 w = w + a w = w + b w = w + c thread 4 z = z + a z = z + b z = z + c 4 independent operations

Tuesday, September 18, 12

slide-59
SLIDE 59

Instruction-level parallelism (ILP)

13

But you can also use parallelism among instructions in a single thread:

x = x + a y = y + a w = w + a z = z + a x = x + b y = y + b w = w + b z = z + b instructions thread 4 independent

  • perations

Tuesday, September 18, 12

slide-60
SLIDE 60

You can use both ILP and TLP on GPU

This applies to all CUDA-capable GPUs. E.g. on G80:

‒ Get ≈100% peak with 25% occupancy if no ILP ‒ Or with 8% occupancy, if 3 operations from each thread can be concurrently processed

On GF104 you must use ILP to get >66% of peak!

‒ 48 cores/SM, one instruction is broadcast across 16 cores ‒ So, must issue 3 instructions per cycle ‒ But have only 2 warp schedulers ‒ Instead, it can issue 2 instructions per warp in the same cycle

14

Tuesday, September 18, 12

slide-61
SLIDE 61

Let’s ¡check ¡it ¡experimentally

Do many arithmetic instructions with no ILP:

15

#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; }

Choose large N_ITERATIONS and suitable UNROLL Ensure a, b and c are in registers and a is used later Run 1 block (use 1 SM), vary block size

‒ See what fraction of peak (1.3TFLOPS/15) we get

Tuesday, September 18, 12

slide-62
SLIDE 62

No ILP: need 576 threads to get 100% utilization

16

Experimental result (GTX480)

0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024

fraction of peak threads per SM

peak=89.6 Gflop/s

Tuesday, September 18, 12

slide-63
SLIDE 63

Introduce instruction-level parallelism

Try ILP=2: two independent instruction per thread

17

#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; d = d * b + c; }

If multithreading is the only way to hide latency

  • n ¡GPU, ¡we’ve ¡got ¡to ¡get ¡the ¡same ¡performance

Tuesday, September 18, 12

slide-64
SLIDE 64

ILP=2: need 320 threads to get 100% utilization

18

GPUs can hide latency using ILP

0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024

fraction of peak threads per SM

Tuesday, September 18, 12

slide-65
SLIDE 65

Add more instruction-level parallelism

ILP=3: triples of independent instructions

19

#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; d = d * b + c; e = e * b + c; }

How far can we push it?

Tuesday, September 18, 12

slide-66
SLIDE 66

ILP=3: need 256 threads to get 100% utilization

20

Have more ILP – need fewer threads

0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024

fraction of peak threads per SM

Tuesday, September 18, 12

slide-67
SLIDE 67

ILP=4: need 192 threads to get 100% utilization

21

Unfortunately, ¡doesn’t ¡scale ¡past ¡ILP=4

0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024

fraction of peak threads per SM

Tuesday, September 18, 12

slide-68
SLIDE 68

Summary: can hide latency either way

22

0% 20% 40% 60% 80% 100% 256 512 768 1024

Thread parallelism

fixed instruction paralleism (ILP=1)

0% 20% 40% 60% 80% 100% 1 2 3 4 5 6

Instruction parallelism

fixed thread parallelism (12.5% occupancy)

Tuesday, September 18, 12

slide-69
SLIDE 69

Fallacy:

Increasing occupancy is the only way to improve latency hiding – No, increasing ILP is another way.

24

Tuesday, September 18, 12

slide-70
SLIDE 70

Fallacy:

Occupancy is a metric of utilization – No, ¡it’s ¡only ¡one ¡of ¡the ¡contributing ¡factors.

25

Tuesday, September 18, 12

slide-71
SLIDE 71

Part II: Hide memory latency using fewer threads

27

Tuesday, September 18, 12

slide-72
SLIDE 72

Hiding memory latency

Apply same formula but for memory operations:

28

Latency Throughput Parallelism Arithmetic ≈18 cycles 32 ops/SM/cycle 576 ops/SM Memory < 800 cycles (?) < 177 GB/s < 100 KB

Needed parallelism = Latency x Throughput

So, hide memory latency = keep 100 KB in the flight ‒ Less if kernel is compute bound (needs fewer GB/s)

Tuesday, September 18, 12

slide-73
SLIDE 73

Copy one float per thread:

30

__global__ void memcpy( float *dst, float *src ) { int block = blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + block * blockDim.x; float a0 = src[index]; dst[index] = a0; }

Empirical validation

Run many blocks, allocate shared memory dynamically to control occupancy

Tuesday, September 18, 12

slide-74
SLIDE 74

Copying 1 float per thread (GTX480)

Must maximize occupancy to hide latency?

31

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

fraction of peak

  • ccupancy

peak=177.4GB/s

Tuesday, September 18, 12

slide-75
SLIDE 75

32

__global__ void memcpy( float *dst, float *src ) { int iblock= blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + 2 * iblock * blockDim.x; float a0 = src[index]; //no latency stall float a1 = src[index+blockDim.x]; //stall dst[index] = a0; dst[index+blockDim.x] = a1; }

Do more parallel work per thread

Note, threads ¡don’t ¡stall ¡on ¡memory ¡access

– Only on data dependency

Tuesday, September 18, 12

slide-76
SLIDE 76

Copying 2 float values per thread

33

Can get away with lower occupancy now

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

fraction of peak

  • ccupancy

Tuesday, September 18, 12

slide-77
SLIDE 77

34

__global__ void memcpy( float *dst, float *src ) { int iblock = blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + 4 * iblock * blockDim.x; float a[4];//allocated in registers for(int i=0;i<4;i++) a[i]=src[index+i*blockDim.x]; for(int i=0;i<4;i++) dst[index+i*blockDim.x]=a[i]; }

Do more parallel work per thread

Note, local arrays are allocated in registers if possible

Tuesday, September 18, 12

slide-78
SLIDE 78

Copying 4 float values per thread

35

Mere 25% occupancy is sufficient. How far we can go?

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

fraction of peak

  • ccupancy

Tuesday, September 18, 12

slide-79
SLIDE 79

84% of peak at 4% occupancy

39

Copying 14 float4 values per thread

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

fraction of peak

  • ccupancy

Tuesday, September 18, 12

slide-80
SLIDE 80

40

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

  • ccupancy

0% 20% 40% 60% 80% 100% 64 128 192 256

bytes per thread

Two ways to hide memory latency

Tuesday, September 18, 12

slide-81
SLIDE 81

44

Fewer threads = more registers per thread

Registers per thread: GF100: 20 at 100% occupancy, 63 at 33% occupancy — 3x GT200: 16 at 100% occupancy, ≈128 at 12.5% occupancy — 8x Is using more registers per thread better?

More threads More registers per thread

32768 registers per SM

Tuesday, September 18, 12

slide-82
SLIDE 82

Only registers are fast enough to get the peak

Consider a*b+c: 2 flops, 12 bytes in, 4 bytes out This is 8.1 TB/s for 1.3 Tflop/s! Registers can accommodate it. Can shared memory? ‒ 4B*32banks*15 SMs*half 1.4GHz = 1.3TB/s only

45

a, b, c @ 8.1 TB/s

a*b+c @ 1.3 Tflop/s result @ 2.7 TB/s

Tuesday, September 18, 12

slide-83
SLIDE 83

Bandwidth needed vs bandwidth available

46

1.3 TB/s

8 TB/s

177 GB/s Global memory Shared memory Needed to get the peak

7.6x 6x

Registers are at least this fast

Tuesday, September 18, 12

slide-84
SLIDE 84

Fallacy:

“In ¡fact, for all threads of a warp, accessing the shared memory is as fast as accessing a register as long as there are no bank conflicts between the threads..” ¡ (CUDA Programming Guide) – No, shared memory bandwidth is  6x lower than register bandwidth on Fermi. (3x before Fermi.)

47

Tuesday, September 18, 12

slide-85
SLIDE 85

Running fast may require low occupancy

  • Must use registers to run close to the peak
  • The larger the bandwidth gap, the more data

must come from registers

  • This may require many registers = low occupancy

This often can be accomplished by computing multiple outputs per thread

48

Tuesday, September 18, 12

slide-86
SLIDE 86

More data is local to a thread in registers ‒ may need fewer shared memory accesses Fewer threads, but more parallel work in thread ‒ So, low occupancy should not be a problem

49

Compute multiple outputs per thread

4 threads 8 threads 16 threads 1 output/thread 2 outputs/thread 4 outputs/thread 4x4 matrix

Tuesday, September 18, 12

slide-87
SLIDE 87

From Tesla to Fermi: regression?

The gap between shared memory and arithmetic throughput has increased: ‒ G80-GT200: 16 banks vs 8 thread processors (2:1) ‒ GF100: 32 banks vs 32 thread processors (1:1) ‒ GF104: 32 banks vs 48 thread processors (2:3) Using fast register memory could help. But instead, register use is restricted: ‒ G80-GT200: up to ≈128 registers per thread ‒ Fermi: up to ≈64 registers per thread

50

Tuesday, September 18, 12

slide-88
SLIDE 88

๏ Two-level memory optimizations (whiteboard)

Tuesday, September 18, 12