GPU tuning, part 1 (updated)
CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2
vuduc.org/cse6230
GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - - PowerPoint PPT Presentation
vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9
CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2
vuduc.org/cse6230
2
Recall:
2
Recall:
3
Recall:
4
Recall:
5
Recall:
6
Recall:
7
Recall:
8
Recall:
9
Recall:
vuduc.org/cse6230
(See HPCA’10 tutorial)
von Neumann bottleneck Slow memory
xPU Fast memory (total size = Z)
W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z)
Q mops W (fl)ops
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Intensity (flop : mop) Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Intensity (flop : mop) Balance (flop : mop) Minimum time Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
von Neumann bottleneck
T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E
Slow memory
xPU
τmem = time/mop τflop = time/flop
Fast memory (total size = Z)
Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
Intensity (flop : mop) Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
“Roofline” — Williams et al. (Comm. ACM ’09)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
“Roofline” — Williams et al. (Comm. ACM ’09)
Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound Dense matrix multiply Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply Balance (flop : mop)
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)
flop:byte
Compute bound
“Roofline” — Williams et al. (Comm. ACM ’09)
Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply Balance (flop : mop)
vuduc.org/cse6230
(thread- vs. instruction-level parallelism)
See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/
Latency
[time]
Throughput
[ops/time]
Latency
[time]
Throughput
[ops/time]
#define N <constant-value>
void kernel (float *pa, float b, float c) { float a = *pa;
for (int i=0; i<N; ++i) a = a * b + c;
}
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
0.63 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak
https://bitbucket.org/rvuduc/volkov-gtc10
#define N <constant-value>
void kernel (float *pa, float b, float c) { float a[2] = {0, 0};
for (int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; }
}
https://bitbucket.org/rvuduc/volkov-gtc10
#define N <constant-value>
void kernel (float *pa, float b, float c) { float a[2] = {0, 0};
for (int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; }
} Mutually independent
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
0.63 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
2
0.05 0.09 0.63 0.73 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak increases with ILP
https://bitbucket.org/rvuduc/volkov-gtc10
#define N <constant-value> #define K <constant-value> __global__ void kernel (float *pa, float b, float c) { float a[K] = {0, …, 0};
for (int i=0; i<N; ++i) #pragma unroll for (int k=0; k<K; ++k) a[k] = a[k] * b + c;
} Mutually independent
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
2 4
0.05 0.09 0.13 0.63 0.73 0.89 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak increases with ILP
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
2 4 8
0.05 0.09 0.13 0.15 0.63 0.73 0.89 0.94 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak increases with ILP
https://bitbucket.org/rvuduc/volkov-gtc10
vuduc.org/cse6230
16 2 4 8
0.05 0.09 0.13 0.15 0.63 0.73 0.89 0.94 0.96 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak increases with ILP
https://bitbucket.org/rvuduc/volkov-gtc10
you must exploit ILP to reach peak arithmetic performance!
and one instruction controls 16 such units. Therefore, there should be 3 instructions available per cycle to keep them busy. However, the hardware only has 2 warp schedulers! To compensate, it issues 2 instructions per warp per cycle.
Latency
[time]
Throughput
[ops/time]
upwards of 25k threads! Given 1,024 threads, each needs to move about 106 B.
Latency
[time]
Throughput
[ops/time]
#define K <constant-value>
void memcpy (float* dest, float* src) { int i = …; // index calculation float a[K];
for (int k=0; k<K; ++k) a[k] = src[f(i,k)];
for (int k=0; k<K; ++k) dest[f(i,k)] = a[k]; }
vuduc.org/cse6230
2 4 8
0.25 0.41 0.62 0.64 0.71 0.72 0.7 1 32 64 96128 192 256 384 512 640 768 896 1024
Threads per block
vuduc.org/cse6230
float2 float4
0.25 0.43 0.65 0.71 0.72 1 326496 128 192 256 384 512 640 768 896 1024
Threads per block
Fraction of peak
vuduc.org/cse6230
44
More threads More registers per thread
vuduc.org/cse6230
45
a, b, c @ 8.1 TB/s
a*b+c @ 1.3 Tflop/s result @ 2.7 TB/s
vuduc.org/cse6230
46
vuduc.org/cse6230
(Adapted by Jee Choi [GT] from original tutorial by Mark Harris)
See: http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
vuduc.org/cse6230
Compute bound Memory (bandwidth) bound
vuduc.org/cse6230
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
Compute bound Memory (bandwidth) bound
vuduc.org/cse6230
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
Compute bound Memory (bandwidth) bound
vuduc.org/cse6230
1/32 1/16 1/8 1/4 1/2 1
3.6
GFLOP/s
1/2 1 2 4 8 16 32 64 128
Intensity (FLOP:Byte) Relative performance
Compute bound Memory (bandwidth) bound
M2090: ~ 177 GB/s [125 GB/s measured], ~ 0.25 flop/byte
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory thread ID iteration 1
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
shared memory thread ID iteration 2
24 6 10 7 18 1 14 5 15 11 9 19 5 7 4
shared memory thread ID iteration 3
42 6 10 7 18 1 14 5 34 11 9 19 5 7 4
shared memory thread ID iteration 4 2 4 6 8 10 12 14 4 8 12 8
vuduc.org/cse6230
__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index
Local[tid] = In[i]; // Load into shared mem __syncthreads ();
if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }
}
vuduc.org/cse6230
__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index
Local[tid] = In[i]; // Load into shared mem __syncthreads ();
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory iteration 1 2 4 6 8 10 12 14 load into shared memory 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
vuduc.org/cse6230
__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index
Local[tid] = In[i]; // Load into shared mem __syncthreads ();
if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory iteration 1 2 4 6 8 10 12 14
vuduc.org/cse6230
4 8
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
12 shared memory iteration 2
__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index
Local[tid] = In[i]; // Load into shared mem __syncthreads ();
if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }
N Time Performance Speedup 0: Baseline interleaved 16 M 7.7 ms 8.7 GB/s —
vuduc.org/cse6230
4 8
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
12 shared memory iteration 2
__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index
Local[tid] = In[i]; // Load into shared mem __syncthreads ();
if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); } highly divergent branching
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory thread ID iteration 1
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
shared memory thread ID iteration 2
24 6 10 7 18 1 14 5 15 11 9 19 5 7 4
shared memory thread ID iteration 3
42 6 10 7 18 1 14 5 34 11 9 19 5 7 4
shared memory thread ID iteration 4 2 4 6 8 10 12 14 4 8 12 8
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory thread ID iteration 1 1 2 3 4 5 6 7
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
shared memory thread ID iteration 2 1 2 3
24 6 10 7 18 1 14 5 15 11 9 19 5 7 4
shared memory thread ID iteration 3 1
42 6 10 7 18 1 14 5 34 11 9 19 5 7 4
shared memory thread ID iteration 4
1 2 3 4
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
5 6 7 shared memory thread ID iteration 1 1 2
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
3 shared memory thread ID iteration 2
for (int s=1; s<blockDim.x; s*=2) { // Reduce loop if (tid % (2*s) == 0) // Is a multiple of s (2, 4, 8, …)? Local[tid] += Local[tid + s]; for (int s=1; s<blockDim.x; s*=2) { // Element stride int index = 2*s*tid; // Thread ID stride if (index < blockDim.x) Local[tid] += Local[tid + s];
24 6 10 7 18 1 14 5 15 11 9 19 5 7 4
shared memory thread ID iteration 3 1
42 6 10 7 18 1 14 5 34 11 9 19 5 7 4
shared memory thread ID iteration 4
N Time Performance Speedup
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
shared memory thread ID iteration 1 1 2 3 4 5 6 7
14 6 10 7 4 1 14 5 4 11 9 12 5 7 4
shared memory thread ID iteration 2 1 2 3
24 6 10 7 18 1 14 5 15 11 9 19 5 7 4
shared memory thread ID iteration 3 1
42 6 10 7 18 1 14 5 34 11 9 19 5 7 4
shared memory thread ID iteration 4
Shared memory bank conflicts
vuduc.org/cse6230
1 2 3 4 5 6 7 8 9 Array Shared memory Example: Array mapped to 4 banks. Maximum throughput occurs when delivering 1 word per bank.
1 2 3 4 5 6 7 8 9
shared memory thread ID
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
iteration 1 1 3 4 5 6 7 2 shared memory thread ID
12 6 5 16 10 6 12 9 4 2 9 7 5 3 4
iteration 2 1 3 2 shared memory thread ID
22 12 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 3 1 shared memory thread ID
39 37 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 4
N Time Performance Speedup
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
2: sequential addressing
16 M 4.0 ms 17 GB/s 1.9×
– divergence, bank conflicts ¡ – data loading is already coalesced ¡
8 6 3 7 3 1 9 5 4 2 9 7 5 3 4
iteration 1 1 3 4 5 6 7 2 load into shared memory 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
– reduces wasted threads
– divergence, bank conflicts ¡ – data loading is already coalesced ¡
– only half of the threads at most are working at any given time ¡
– reduces wasted threads ¡ – doubles the number of memory requests
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
2: sequential addressing
16 M 4.0 ms 17 GB/s 1.9×
3: 1/2 threads & first add
16 M 2.2 ms 31 GB/s 3.5×
– we don’t need __syncthreads (); ¡ – we don’t need conditionals because all threads will execute synchronously ¡
vuduc.org/cse6230
for (int s = … /* decreasing */ …) { if (tid < s) Local[tid] += Local[tid + s]; __syncthreads (); }
1 3 2
22 12 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 3 1
39 37 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 4
vuduc.org/cse6230
for (int s = … /* decreasing, but more than warp size, e.g., 32 */ …) { if (tid < s) Local[tid] += Local[tid + s]; __syncthreads (); }
iteration 2 1 3 2
22 12 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 3 1
39 37 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 4
vuduc.org/cse6230
__device__ void warpReduce (volatile* int Local, int tid) { Local[tid] += Local[tid + 32]; Local[tid] += Local[tid + 16]; Local[tid] += Local[tid + 8]; Local[tid] += Local[tid + 4]; Local[tid] += Local[tid + 2]; Local[tid] += Local[tid + 1]; }
iteration 2 1 3 2
22 12 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 3 1
39 37 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 4
vuduc.org/cse6230
__device__ void warpReduce (volatile* int Local, int tid) { Local[tid] += Local[tid + 32]; Local[tid] += Local[tid + 16]; Local[tid] += Local[tid + 8]; Local[tid] += Local[tid + 4]; Local[tid] += Local[tid + 2]; Local[tid] += Local[tid + 1]; }
iteration 2 1 3 2
22 12 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 3 1
39 37 17 25 10 6 12 9 4 2 9 7 5 3 4
iteration 4
tid=0: … Local[0] += Local[2]; Local[0] += Local[1]; tid=1: … Local[1] += Local[3]; Local[1] += Local[2];
N Time Performance Speedup
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
2: sequential addressing
16 M 4.0 ms 17 GB/s 1.9×
3: 1/2 threads & first add
16 M 2.2 ms 31 GB/s 3.5×
4: unroll the last warp
16 M 1.3 ms 51 GB/s 5.9×
– C++ templates ¡ – constants for thread block size (via compiler options
– Template vars (underlined) evaluated at compile-time
template <int TBS> // Thread-block size __global__ void gpuReduce (…) { … if (TBS>=1024) if (tid<512) { Local[tid] += Local[tid+512]; __syncthreads(); } if (TBS>=512) if (tid<256) { Local[tid] += Local[tid+256]; __syncthreads(); } if (TBS>=256) if (tid<128) { Local[tid] += Local[tid+128]; __syncthreads(); } if (TBS>=128) if (tid<64) { Local[tid] += Local[tid+64]; __syncthreads(); }
}
N Time Performance Speedup
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
2: sequential addressing
16 M 4.0 ms 17 GB/s 1.9×
3: 1/2 threads & first add
16 M 2.2 ms 31 GB/s 3.5×
4: unroll the last warp
16 M 1.3 ms 51 GB/s 5.9×
5: complete unrolling
16 M 1.2 ms 54 GB/s 6.2×
– instead of each thread adding two elements in the beginning, add more ¡
– reduces thread block scheduling overhead ¡ – increase number of memory requests
N Time Performance Speedup
0: Baseline
16 M 7.7 ms 8.7 GB/s —
1: non-divergent threading
16 M 5.5 ms 12 GB/s 1.4×
2: sequential addressing
16 M 4.0 ms 17 GB/s 1.9×
3: 1/2 threads & first add
16 M 2.2 ms 31 GB/s 3.5×
4: unroll the last warp
16 M 1.3 ms 51 GB/s 5.9×
5: complete unrolling
16 M 1.2 ms 54 GB/s 6.2×
6: more work per thread
vuduc.org/cse6230
…
Make work optimal Maximize parallelism
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕
← L → Assuming consecutive layout
vuduc.org/cse6230
An abstract distributed memory machine:
Processing “nodes” have private memory, communicate by “message passing.”
xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory
vuduc.org/cse6230
An abstract distributed memory machine:
Processing “nodes” have private memory, communicate by “message passing.”
non-uniform memory access (NUMA).
Core0 Core1 Core2 Core3
DRAM 1 2 3
DRAM
Example: Two quad-core CPUs with logically shared but physically distributed memory
vuduc.org/cse6230
Core0 Core1 Core2 Core3
CPU 1
DRAM
1
Infiniband
1 2 3
CPU 2
DRAM
QPI DDR3 QPI
I/O hub I/O hub
QPI
integrated PCIe x16 PCIe x16 PCIe x16
GPU 1 GPU 2 GPU 3
Node
Example: Adding GPUs, which have private memory address spaces, to the previous example