Comment on bitonic merging; more CUDA performance tuning
CSE 6230: HPC Tools & Apps Tu Sep 18, 2012
Tuesday, September 18, 12
Comment on bitonic merging; more CUDA performance tuning CSE 6230: - - PowerPoint PPT Presentation
Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.
CSE 6230: HPC Tools & Apps Tu Sep 18, 2012
Tuesday, September 18, 12
๏ Comment on bitonic merging, including ideas & hints for Lab 3
Note: Some figures taken from Grama et al. book (2003) http://www-users.cs.umn.edu/~karypis/parbook/ This book is also available online through the GT library – see our course website.
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Source: Grama et al. (2003)
Tuesday, September 18, 12
Tuesday, September 18, 12
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
Tuesday, September 18, 12
Block Layout (p=4)
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (n/p) steps: No comm log p steps: Comm req’d
Tuesday, September 18, 12
Block Layout (p=4)
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (n/p) steps: No comm log p steps: Comm req’d
Tuesday, September 18, 12
Block Layout (p=4)
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (n/p) steps: No comm log p steps: Comm req’d
Tuesday, September 18, 12
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (p): Comm req’d log (n/p): No comm Cyclic Layout (p=4)
Tuesday, September 18, 12
Tuesday, September 18, 12
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …
Tuesday, September 18, 12
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …
Tuesday, September 18, 12
0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111
log (p): No comm log (n/p): No comm “Transpose” (p=4) … All-to-all exchange …
Tuesday, September 18, 12
All-to-all exchange
Matrix transpose
Tuesday, September 18, 12
rounds of communication = 1 number of pairwise exchanges per round = O(P2) total number of pairwise exchanges = O(P2) words sent per exchange = O(n/P2) total words sent = O(n) rounds of communication = O(log n) number of pairwise exchanges per round = O(P) total number of pairwise exchanges = O(P log n) words sent per exchange = O(n/P) total words sent = O(n log n)
Tuesday, September 18, 12
๏ More CUDA tuning: Occupancy and ILP
References:
http://developer.nvidia.com/cuda/get-started-cuda-cc http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
/opt/cuda-4.0/cuda/bin/nvcc -arch=sm_20 --ptxas-options=-v -O3 \
ptxas info : Compiling entry function '_Z12bitonicSplitjPfj' for 'sm_20' ptxas info : Function properties for _Z12bitonicSplitjPfj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers, 52 bytes cmem[0]
icpc -O3 -g -o bitmerge timer.o bitmerge.o bitmerge-seq.o \ bitmerge-cilk.o bitmerge-cuda.o \
Tuesday, September 18, 12
Occupancy Limiters: Registers
Register usage: compile with --ptxas-options=-v Fermi has 32K registers per SM Example 1
Kernel uses 20 registers per thread (+1 implicit) Active threads = 32K/21 = 1560 threads
> 1536 thus an occupancy of 1
Example 2
Kernel uses 63 registers per thread (+1 implicit) Active threads = 32K/64 = 512 threads 512/1536 = .3333 occupancy
Can control register usage with the nvcc flag: --maxrregcount
Occupancy = (Active warps) / (Max active warps)
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
https://piazza.com/class#fall2012/cse6230/52
Tuesday, September 18, 12
Recall: Reduction example
Tuesday, September 18, 12
Recall: Reduction example
Tuesday, September 18, 12
Recall: Reduction example
Tuesday, September 18, 12
Recall: Reduction example
b = 256 threads/block ⇒ shmem = 256 * (4 Bytes/int) = 1024 Bytes
Tuesday, September 18, 12
Occupancy Limiters: Shared Memory
Shared memory usage: compile with --ptxas-options=-v
Reports shared memory per block
Fermi has either 16K or 48K shared memory Example 1, 48K shared memory
Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536 threads
Example 2, 16K shared memory
Kernel uses 32 bytes of shared memory per thread 16K/32 = 512 threads
Don’t use too much shared memory Choose L1/Shared config appropriately.
Occupancy = (Active warps) / (Max active warps)
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy
Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block
Resources are finite Utilizing too many resources per thread may limit the occupancy
Potential occupancy limiters:
Register usage Shared memory usage Block size
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy Limiter: Block Size
Each SM can have up to 8 active blocks A small block size will limit the total number of threads Avoid small block sizes, generally 128-256 threads is sufficient
Block Size Active Threads Occupancy 32 256 .1666 64 512 .3333 128 1024 .6666 192 1536 1 256 2048 (1536) 1
Occupancy = (Active warps) / (Max active warps)
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
Occupancy Limiter: Block Size
Each SM can have up to 8 active blocks A small block size will limit the total number of threads Avoid small block sizes, generally 128-256 threads is sufficient
Block Size Active Threads Occupancy 32 256 .1666 64 512 .3333 128 1024 .6666 192 1536 1 256 2048 (1536) 1
Occupancy = (Active warps) / (Max active warps)
Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp
Tuesday, September 18, 12
What Occupancy Do I Need?
Depends on your problem…
Many find 66% is enough to saturate the bandwidth
Look at increasing occupancy only if the following are true!
The kernel is bandwidth bound The achieved bandwidth is significantly less than peak
Instruction Level Parallelism (ILP) can have a greater effect than increasing occupancy
Vasily Volkov’s GTC2010 talk “Better Performance at Lower Occupancy” http://nvidia.fullviewmedia.com/gtc2010/0922-a5-2238.html
Tuesday, September 18, 12
Cuda Occupancy Calculator
A tool to help you investigate occupancy http://developer.download.nvidia.com/compute/cuda/4_0/sdk/doc s/CUDA_Occupancy_Calculator.xls Demo: CUDA_Occupancy_calculator.xls
Tuesday, September 18, 12
1
Tuesday, September 18, 12
2
Tuesday, September 18, 12
CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower occupancy Performance (G80) 45 Gflop/s 93 Gflop/s 2x higher performance CUBLAS 1.1 CUBLAS 2.0 Threads per block 512 64 8x smaller thread blocks Occupancy (G80) 67% 33% 2x lower occupancy Performance (G80) 128 Gflop/s 204 Gflop/s 1.6x higher performance
3
Tuesday, September 18, 12
4
Tuesday, September 18, 12
x = a + b;// takes ≈20 cycles to execute y = a + c;// independent, can start anytime (stall) z = x + d;// dependent, must wait for completion
7
Tuesday, September 18, 12
8
Tuesday, September 18, 12
Why parallelism? Reason 2: Time to move data
Little’s Law (queuing theory) explains how concurrency helps to hide latency.
Tuesday, September 18, 12
Why parallelism? Reason 2: Time to move data
Little’s Law (queuing theory) explains how concurrency helps to hide latency.
Historical note Latency halves ~ 9 years Bandwidth doubles ~ 3 years
Tuesday, September 18, 12
11
Tuesday, September 18, 12
12
Tuesday, September 18, 12
13
Tuesday, September 18, 12
14
Tuesday, September 18, 12
15
#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; }
Tuesday, September 18, 12
16
0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024
Tuesday, September 18, 12
17
#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; d = d * b + c; }
Tuesday, September 18, 12
18
0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024
Tuesday, September 18, 12
19
#pragma unroll UNROLL for( int i = 0; i < N_ITERATIONS; i++ ) { a = a * b + c; d = d * b + c; e = e * b + c; }
Tuesday, September 18, 12
20
0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024
Tuesday, September 18, 12
21
0% 20% 40% 60% 80% 100% 128 256 384 512 640 768 896 1024
Tuesday, September 18, 12
22
0% 20% 40% 60% 80% 100% 256 512 768 1024
fixed instruction paralleism (ILP=1)
0% 20% 40% 60% 80% 100% 1 2 3 4 5 6
fixed thread parallelism (12.5% occupancy)
Tuesday, September 18, 12
24
Tuesday, September 18, 12
25
Tuesday, September 18, 12
27
Tuesday, September 18, 12
28
Tuesday, September 18, 12
30
__global__ void memcpy( float *dst, float *src ) { int block = blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + block * blockDim.x; float a0 = src[index]; dst[index] = a0; }
Tuesday, September 18, 12
31
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Tuesday, September 18, 12
32
__global__ void memcpy( float *dst, float *src ) { int iblock= blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + 2 * iblock * blockDim.x; float a0 = src[index]; //no latency stall float a1 = src[index+blockDim.x]; //stall dst[index] = a0; dst[index+blockDim.x] = a1; }
Tuesday, September 18, 12
33
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Tuesday, September 18, 12
34
__global__ void memcpy( float *dst, float *src ) { int iblock = blockIdx.x + blockIdx.y * gridDim.x; int index = threadIdx.x + 4 * iblock * blockDim.x; float a[4];//allocated in registers for(int i=0;i<4;i++) a[i]=src[index+i*blockDim.x]; for(int i=0;i<4;i++) dst[index+i*blockDim.x]=a[i]; }
Tuesday, September 18, 12
35
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Tuesday, September 18, 12
39
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Tuesday, September 18, 12
40
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
0% 20% 40% 60% 80% 100% 64 128 192 256
Tuesday, September 18, 12
44
More threads More registers per thread
Tuesday, September 18, 12
45
a, b, c @ 8.1 TB/s
a*b+c @ 1.3 Tflop/s result @ 2.7 TB/s
Tuesday, September 18, 12
46
Tuesday, September 18, 12
47
Tuesday, September 18, 12
48
Tuesday, September 18, 12
49
4 threads 8 threads 16 threads 1 output/thread 2 outputs/thread 4 outputs/thread 4x4 matrix
Tuesday, September 18, 12
50
Tuesday, September 18, 12
๏ Two-level memory optimizations (whiteboard)
Tuesday, September 18, 12