GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - - PowerPoint PPT Presentation

gpu tuning part 1 updated
SMART_READER_LITE
LIVE PREVIEW

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9


slide-1
SLIDE 1

GPU tuning, part 1 (updated)

CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2

vuduc.org/cse6230

slide-2
SLIDE 2

2

Recall:

slide-3
SLIDE 3

2

6 GB/s

Recall:

slide-4
SLIDE 4

3

Recall:

slide-5
SLIDE 5

4

Recall:

slide-6
SLIDE 6

5

Recall:

slide-7
SLIDE 7

6

Recall:

slide-8
SLIDE 8

7

Recall:

slide-9
SLIDE 9

8

Recall:

slide-10
SLIDE 10

9

Recall:

slide-11
SLIDE 11

vuduc.org/cse6230

Performance engineering principles

(See HPCA’10 tutorial)

slide-12
SLIDE 12

von Neumann bottleneck Slow memory

xPU Fast memory (total size = Z)

W ≡ # (fl)ops Q ≡ # mem. ops (mops) = Q(Z)

Q mops W (fl)ops

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-13
SLIDE 13

von Neumann bottleneck Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-14
SLIDE 14

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-15
SLIDE 15

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-16
SLIDE 16

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-17
SLIDE 17

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Intensity (flop : mop) Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-18
SLIDE 18

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Intensity (flop : mop) Balance (flop : mop) Minimum time Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

slide-19
SLIDE 19

von Neumann bottleneck

T = max (W⌧flop, Q⌧mem) = W⌧flop max ✓ 1, Q W ⌧mem ⌧flop ◆ = W⌧flop max ✓ 1, B⌧ I ◆ E = W✏flop + Q✏mem = W✏flop ✓ 1 + B✏ I ◆ Consider: W⌧flop T and W✏flop E

Slow memory

xPU

τmem = time/mop τflop = time/flop

Fast memory (total size = Z)

Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

Intensity (flop : mop) Balance (flop : mop)

slide-20
SLIDE 20

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

“Roofline” — Williams et al. (Comm. ACM ’09)

slide-21
SLIDE 21

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

“Roofline” — Williams et al. (Comm. ACM ’09)

Balance (flop : mop)

slide-22
SLIDE 22

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Balance (flop : mop)

slide-23
SLIDE 23

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound Balance (flop : mop)

slide-24
SLIDE 24

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound Dense matrix multiply Balance (flop : mop)

slide-25
SLIDE 25

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils Dense matrix multiply Balance (flop : mop)

slide-26
SLIDE 26

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance Balance estimate for a high-end NVIDIA Fermi in double-precision, according to Keckler et al. IEEE Micro (2011)

flop:byte

Compute bound

“Roofline” — Williams et al. (Comm. ACM ’09)

Memory (bandwidth) bound sparse matvec; stencils FFTs Dense matrix multiply Balance (flop : mop)

slide-27
SLIDE 27

vuduc.org/cse6230

TLP vs. ILP

(thread- vs. instruction-level parallelism)

See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/

slide-28
SLIDE 28

Recall Little’s Law, which quantifies the degree of concurrency needed to hide latency.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Latency

[time]

Throughput

[ops/time]

slide-29
SLIDE 29

The NVIDIA M2090 implements a fused multiply-add (FMA) with a latency of ~ 20 cycle. It issues up to 32 FMAs per cycle.

  • Concurrency ~ (20 cy) * (32 ops/cy),
  • r 640 operations.
  • So, a thread block size of 640 threads

should fully hide the latency.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Latency

[time]

Throughput

[ops/time]

slide-30
SLIDE 30

#define N <constant-value>

  • __global__

void kernel (float *pa, float b, float c) { float a = *pa;

  • #pragma unroll 8

for (int i=0; i<N; ++i) a = a * b + c;

  • *pa = a;

}

https://bitbucket.org/rvuduc/volkov-gtc10

slide-31
SLIDE 31

vuduc.org/cse6230

  • 0.05

0.63 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak

Plateau starts roughly where expected (~ 640 threads)

https://bitbucket.org/rvuduc/volkov-gtc10

slide-32
SLIDE 32

#define N <constant-value>

  • __global__

void kernel (float *pa, float b, float c) { float a[2] = {0, 0};

  • #pragma unroll 8

for (int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; }

  • *pa += a[0] + a[1];

}

https://bitbucket.org/rvuduc/volkov-gtc10

slide-33
SLIDE 33

#define N <constant-value>

  • __global__

void kernel (float *pa, float b, float c) { float a[2] = {0, 0};

  • #pragma unroll 8

for (int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; }

  • *pa += a[0] + a[1];

} Mutually independent

https://bitbucket.org/rvuduc/volkov-gtc10

slide-34
SLIDE 34

vuduc.org/cse6230

  • 0.05

0.63 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak

Plateau starts roughly where expected (~ 640 threads)

https://bitbucket.org/rvuduc/volkov-gtc10

slide-35
SLIDE 35

vuduc.org/cse6230

  • 1

2

0.05 0.09 0.63 0.73 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak increases with ILP

But fraction of peak increases with ILP

https://bitbucket.org/rvuduc/volkov-gtc10

slide-36
SLIDE 36

#define N <constant-value> #define K <constant-value> __global__ void kernel (float *pa, float b, float c) { float a[K] = {0, …, 0};

  • #pragma unroll 8

for (int i=0; i<N; ++i) #pragma unroll for (int k=0; k<K; ++k) a[k] = a[k] * b + c;

  • for (int k=0; k<K; ++k) *pa += a[k];

} Mutually independent

https://bitbucket.org/rvuduc/volkov-gtc10

slide-37
SLIDE 37

vuduc.org/cse6230

  • 1

2 4

0.05 0.09 0.13 0.63 0.73 0.89 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak increases with ILP

But fraction of peak increases with ILP

https://bitbucket.org/rvuduc/volkov-gtc10

slide-38
SLIDE 38

vuduc.org/cse6230

  • 1

2 4 8

0.05 0.09 0.13 0.15 0.63 0.73 0.89 0.94 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak increases with ILP

But fraction of peak increases with ILP

https://bitbucket.org/rvuduc/volkov-gtc10

slide-39
SLIDE 39

vuduc.org/cse6230

  • 1

16 2 4 8

0.05 0.09 0.13 0.15 0.63 0.73 0.89 0.94 0.96 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak increases with ILP

But fraction of peak increases with ILP

https://bitbucket.org/rvuduc/volkov-gtc10

slide-40
SLIDE 40

In short, you should exploit both TLP and ILP .

  • Indeed, on Jinx’s Fermi-based NVIDIA GPUs

you must exploit ILP to reach peak arithmetic performance!

  • There are 48 FMA-capable functional units,

and one instruction controls 16 such units. Therefore, there should be 3 instructions available per cycle to keep them busy. However, the hardware only has 2 warp schedulers! To compensate, it issues 2 instructions per warp per cycle.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Latency

[time]

Throughput

[ops/time]

slide-41
SLIDE 41

ILP idea applies to memory.

  • On NVIDIA M2090, clock is 1.3 GHz,

memory latency is ~ 800 cycles, and pin bandwidth is ~ 177 GB/s.1 Therefore, your computation needs

  • (800 cy) / (1.3 GHz) * (177 GB/s)

~ 106 KiB in flight.2

  • 1 Measured bandwidth is closer to ~ 120 GB/s.
  • 2 If each thread moves 4 B, then you might need

upwards of 25k threads! Given 1,024 threads, each needs to move about 106 B.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Latency

[time]

Throughput

[ops/time]

slide-42
SLIDE 42

#define K <constant-value>

  • __global__

void memcpy (float* dest, float* src) { int i = …; // index calculation float a[K];

  • #pragma unroll

for (int k=0; k<K; ++k) a[k] = src[f(i,k)];

  • #pragma unroll

for (int k=0; k<K; ++k) dest[f(i,k)] = a[k]; }

slide-43
SLIDE 43

vuduc.org/cse6230

  • 1

2 4 8

0.25 0.41 0.62 0.64 0.71 0.72 0.7 1 32 64 96128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak increases with ILP for memory accesses, too

slide-44
SLIDE 44

vuduc.org/cse6230

  • float

float2 float4

0.25 0.43 0.65 0.71 0.72 1 326496 128 192 256 384 512 640 768 896 1024

Threads per block

Fraction of peak

Fraction of peak also increases with word size

slide-45
SLIDE 45

These ILP-enhancing techniques have one more positive side effect: they increase register usage!

  • Why does that matter? Here is

Volkov’s argument.

slide-46
SLIDE 46

vuduc.org/cse6230

44

Fewer threads = more registers per thread

Registers per thread: GF100: 20 at 100% occupancy, 63 at 33% occupancy — 3x GT200: 16 at 100% occupancy, ≈128 at 12.5% occupancy — 8x Is using more registers per thread better?

More threads More registers per thread

32768 registers per SM

slide-47
SLIDE 47

vuduc.org/cse6230

Only registers are fast enough to get the peak

Consider a*b+c: 2 flops, 12 bytes in, 4 bytes out This is 8.1 TB/s for 1.3 Tflop/s! Registers can accommodate it. Can shared memory? ‒ 4B*32banks*15 SMs*half 1.4GHz = 1.3TB/s only

45

a, b, c @ 8.1 TB/s

a*b+c @ 1.3 Tflop/s result @ 2.7 TB/s

slide-48
SLIDE 48

vuduc.org/cse6230

Bandwidth needed vs bandwidth available

46

1.3 TB/s

8 TB/s

177 GB/s Global memory Shared memory Needed to get the peak

7.6x 6x

Registers are at least this fast

slide-49
SLIDE 49

vuduc.org/cse6230

Application: Tuning reductions

(Adapted by Jee Choi [GT] from original tutorial by Mark Harris)

See: http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

slide-50
SLIDE 50

vuduc.org/cse6230

slide-51
SLIDE 51

vuduc.org/cse6230

slide-52
SLIDE 52

vuduc.org/cse6230

slide-53
SLIDE 53

vuduc.org/cse6230

slide-54
SLIDE 54

vuduc.org/cse6230

slide-55
SLIDE 55

vuduc.org/cse6230

slide-56
SLIDE 56

vuduc.org/cse6230

slide-57
SLIDE 57

vuduc.org/cse6230

slide-58
SLIDE 58

vuduc.org/cse6230

slide-59
SLIDE 59

vuduc.org/cse6230

Compute bound Memory (bandwidth) bound

slide-60
SLIDE 60

vuduc.org/cse6230

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

Compute bound Memory (bandwidth) bound

slide-61
SLIDE 61

vuduc.org/cse6230

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

Compute bound Memory (bandwidth) bound

Compute or memory bound?

slide-62
SLIDE 62

vuduc.org/cse6230

1/32 1/16 1/8 1/4 1/2 1

3.6

GFLOP/s

1/2 1 2 4 8 16 32 64 128

Intensity (FLOP:Byte) Relative performance

Compute bound Memory (bandwidth) bound

Compute or memory bound?

M2090: ~ 177 GB/s [125 GB/s measured], ~ 0.25 flop/byte

slide-63
SLIDE 63

Baseline

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory thread ID iteration 1

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

shared memory thread ID iteration 2

24 6 10 7 18 1 14 5 15 11 9 19 5 7 4

shared memory thread ID iteration 3

42 6 10 7 18 1 14 5 34 11 9 19 5 7 4

shared memory thread ID iteration 4 2 4 6 8 10 12 14 4 8 12 8

slide-64
SLIDE 64

vuduc.org/cse6230

__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index

  • extern __shared__ int Local[];

Local[tid] = In[i]; // Load into shared mem __syncthreads ();

  • for (int s=1; s<blockDim.x; s*=2) { // Reduce loop

if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }

  • if (tid == 0) Out[blockIdx.x] = Local[0];

}

slide-65
SLIDE 65

vuduc.org/cse6230

__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index

  • extern __shared__ int Local[];

Local[tid] = In[i]; // Load into shared mem __syncthreads ();

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory iteration 1 2 4 6 8 10 12 14 load into shared memory 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15

slide-66
SLIDE 66

vuduc.org/cse6230

__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index

  • extern __shared__ int Local[];

Local[tid] = In[i]; // Load into shared mem __syncthreads ();

  • for (int s=1; s<blockDim.x; s*=2) { // Reduce loop

if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory iteration 1 2 4 6 8 10 12 14

slide-67
SLIDE 67

vuduc.org/cse6230

4 8

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

12 shared memory iteration 2

__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index

  • extern __shared__ int Local[];

Local[tid] = In[i]; // Load into shared mem __syncthreads ();

  • for (int s=1; s<blockDim.x; s*=2) { // Reduce loop

if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); }

slide-68
SLIDE 68

Performance

N Time Performance Speedup 0: Baseline interleaved 16 M 7.7 ms 8.7 GB/s —

slide-69
SLIDE 69

vuduc.org/cse6230

4 8

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

12 shared memory iteration 2

__global__ void reduce (const int* In, int* Out) { int tid = threadIdx.x; // Local thread ID int i = blockIdx.x*blockDim.x + tid; // Global index

  • extern __shared__ int Local[];

Local[tid] = In[i]; // Load into shared mem __syncthreads ();

  • for (int s=1; s<blockDim.x; s*=2) { // Reduce loop

if (tid % (2*s) == 0) // Is multiple of s (2, 4, 8, …) Local[tid] += Local[tid + s]; __syncthreads (); } highly divergent branching

slide-70
SLIDE 70

Baseline

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory thread ID iteration 1

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

shared memory thread ID iteration 2

24 6 10 7 18 1 14 5 15 11 9 19 5 7 4

shared memory thread ID iteration 3

42 6 10 7 18 1 14 5 34 11 9 19 5 7 4

shared memory thread ID iteration 4 2 4 6 8 10 12 14 4 8 12 8

slide-71
SLIDE 71

Version 1: Non-divergent threading

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory thread ID iteration 1 1 2 3 4 5 6 7

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

shared memory thread ID iteration 2 1 2 3

24 6 10 7 18 1 14 5 15 11 9 19 5 7 4

shared memory thread ID iteration 3 1

42 6 10 7 18 1 14 5 34 11 9 19 5 7 4

shared memory thread ID iteration 4

slide-72
SLIDE 72

Version 2 – non-divergent threading

1 2 3 4

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

5 6 7 shared memory thread ID iteration 1 1 2

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

3 shared memory thread ID iteration 2

for (int s=1; s<blockDim.x; s*=2) { // Reduce loop if (tid % (2*s) == 0) // Is a multiple of s (2, 4, 8, …)? Local[tid] += Local[tid + s]; for (int s=1; s<blockDim.x; s*=2) { // Element stride int index = 2*s*tid; // Thread ID stride if (index < blockDim.x) Local[tid] += Local[tid + s];

24 6 10 7 18 1 14 5 15 11 9 19 5 7 4

shared memory thread ID iteration 3 1

42 6 10 7 18 1 14 5 34 11 9 19 5 7 4

shared memory thread ID iteration 4

slide-73
SLIDE 73

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

slide-74
SLIDE 74

Version 1: Non-divergent threading

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

shared memory thread ID iteration 1 1 2 3 4 5 6 7

14 6 10 7 4 1 14 5 4 11 9 12 5 7 4

shared memory thread ID iteration 2 1 2 3

24 6 10 7 18 1 14 5 15 11 9 19 5 7 4

shared memory thread ID iteration 3 1

42 6 10 7 18 1 14 5 34 11 9 19 5 7 4

shared memory thread ID iteration 4

Shared memory bank conflicts

slide-75
SLIDE 75

vuduc.org/cse6230

1 2 3 4 5 6 7 8 9 Array Shared memory Example: Array mapped to 4 banks. Maximum throughput occurs when delivering 1 word per bank.

1 2 3 4 5 6 7 8 9

slide-76
SLIDE 76

Version 2 – sequential addressing

shared memory thread ID

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

iteration 1 1 3 4 5 6 7 2 shared memory thread ID

12 6 5 16 10 6 12 9 4 2 9 7 5 3 4

iteration 2 1 3 2 shared memory thread ID

22 12 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 3 1 shared memory thread ID

39 37 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 4

slide-77
SLIDE 77

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

2: sequential addressing

16 M 4.0 ms 17 GB/s 1.9×

slide-78
SLIDE 78

Version 3 – first add during load

  • We’ve done most of the basic optimizations

– divergence, bank conflicts ¡ – data loading is already coalesced ¡

  • However, performance is still far from our target. Why?

8 6 3 7 3 1 9 5 4 2 9 7 5 3 4

iteration 1 1 3 4 5 6 7 2 load into shared memory 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15

slide-79
SLIDE 79

Version 3 – first add during load

  • We’ve done most of the basic optimizations ¡

– divergence, bank conflicts ¡ – data loading is already coalesced ¡

  • However, performance is still far from our target. Why? ¡

– only half of the threads at most are working at any given time ¡

  • Halve the number of threads and have each thread load 2

values and add them before local reduction in shared memory ¡

– reduces wasted threads

slide-80
SLIDE 80

Version 3 – first add during load

  • We’ve done most of the basic optimizations ¡

– divergence, bank conflicts ¡ – data loading is already coalesced ¡

  • However, performance is still far from our target. Why? ¡

– only half of the threads at most are working at any given time ¡

  • Halve the number of threads and have each thread load 2

values and add them before local reduction in shared memory ¡

– reduces wasted threads ¡ – doubles the number of memory requests

slide-81
SLIDE 81

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

2: sequential addressing

16 M 4.0 ms 17 GB/s 1.9×

3: 1/2 threads & first add

16 M 2.2 ms 31 GB/s 3.5×

slide-82
SLIDE 82

Version 4 – unrolling the last warp

  • With each iteration, the number of working

threads, t, halves

  • When t <= 32, only a single warp is left

– we don’t need __syncthreads (); ¡ – we don’t need conditionals because all threads will execute synchronously ¡

  • Unroll the loop when t <= 32
slide-83
SLIDE 83

vuduc.org/cse6230

for (int s = … /* decreasing */ …) { if (tid < s) Local[tid] += Local[tid + s]; __syncthreads (); }

  • iteration 2

1 3 2

22 12 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 3 1

39 37 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 4

slide-84
SLIDE 84

vuduc.org/cse6230

for (int s = … /* decreasing, but more than warp size, e.g., 32 */ …) { if (tid < s) Local[tid] += Local[tid + s]; __syncthreads (); }

  • if (tid < 32) warpReduce (Local, tid);

iteration 2 1 3 2

22 12 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 3 1

39 37 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 4

slide-85
SLIDE 85

vuduc.org/cse6230

__device__ void warpReduce (volatile* int Local, int tid) { Local[tid] += Local[tid + 32]; Local[tid] += Local[tid + 16]; Local[tid] += Local[tid + 8]; Local[tid] += Local[tid + 4]; Local[tid] += Local[tid + 2]; Local[tid] += Local[tid + 1]; }

iteration 2 1 3 2

22 12 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 3 1

39 37 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 4

slide-86
SLIDE 86

vuduc.org/cse6230

__device__ void warpReduce (volatile* int Local, int tid) { Local[tid] += Local[tid + 32]; Local[tid] += Local[tid + 16]; Local[tid] += Local[tid + 8]; Local[tid] += Local[tid + 4]; Local[tid] += Local[tid + 2]; Local[tid] += Local[tid + 1]; }

iteration 2 1 3 2

22 12 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 3 1

39 37 17 25 10 6 12 9 4 2 9 7 5 3 4

iteration 4

tid=0: … Local[0] += Local[2]; Local[0] += Local[1]; tid=1: … Local[1] += Local[3]; Local[1] += Local[2];

slide-87
SLIDE 87

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

2: sequential addressing

16 M 4.0 ms 17 GB/s 1.9×

3: 1/2 threads & first add

16 M 2.2 ms 31 GB/s 3.5×

4: unroll the last warp

16 M 1.3 ms 51 GB/s 5.9×

slide-88
SLIDE 88

Version 5 – unrolling the entire loop

  • The total number of threads in a thread block is

fixed (per generation) ¡

  • Most commonly used thread block sizes are

powers of 2 ¡

  • Conditionals based on constant values are

evaluated at compile time ¡

  • You can use ¡

– C++ templates ¡ – constants for thread block size (via compiler options

  • r #define)
slide-89
SLIDE 89

Version 5 – unrolling the entire loop

  • Example

– Template vars (underlined) evaluated at compile-time

template <int TBS> // Thread-block size __global__ void gpuReduce (…) { … if (TBS>=1024) if (tid<512) { Local[tid] += Local[tid+512]; __syncthreads(); } if (TBS>=512) if (tid<256) { Local[tid] += Local[tid+256]; __syncthreads(); } if (TBS>=256) if (tid<128) { Local[tid] += Local[tid+128]; __syncthreads(); } if (TBS>=128) if (tid<64) { Local[tid] += Local[tid+64]; __syncthreads(); }

  • if (tid < 32) warpReduce (…);

}

slide-90
SLIDE 90

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

2: sequential addressing

16 M 4.0 ms 17 GB/s 1.9×

3: 1/2 threads & first add

16 M 2.2 ms 31 GB/s 3.5×

4: unroll the last warp

16 M 1.3 ms 51 GB/s 5.9×

5: complete unrolling

16 M 1.2 ms 54 GB/s 6.2×

slide-91
SLIDE 91

Version 6 – more work per thread

  • Increase the amount work done by each

thread

– instead of each thread adding two elements in the beginning, add more ¡

  • Performance impact

– reduces thread block scheduling overhead ¡ – increase number of memory requests

slide-92
SLIDE 92

Performance

N Time Performance Speedup

0: Baseline

16 M 7.7 ms 8.7 GB/s —

1: non-divergent threading

16 M 5.5 ms 12 GB/s 1.4×

2: sequential addressing

16 M 4.0 ms 17 GB/s 1.9×

3: 1/2 threads & first add

16 M 2.2 ms 31 GB/s 3.5×

4: unroll the last warp

16 M 1.3 ms 51 GB/s 5.9×

5: complete unrolling

16 M 1.2 ms 54 GB/s 6.2×

6: more work per thread

16 M 0.65 ms 103 GB/s 11.8×

slide-93
SLIDE 93

vuduc.org/cse6230

Additional material

slide-94
SLIDE 94

W(n) = total work

  • D(n) = depth or span

(critical path length)

  • W(n) / D(n)

= Inherent parallelism

  • Design criteria:

Make work optimal Maximize parallelism

slide-95
SLIDE 95

Example: Reduction

W(n) = O(n) D(n) = O(log n) ✓ Is work optimal ✓ Parallelism = O(n / log n)

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

slide-96
SLIDE 96

Example: Reduction

⊕ ⊕

← Working set = Z words

← L → Assuming consecutive layout

slide-97
SLIDE 97

vuduc.org/cse6230

An abstract distributed memory machine:

Processing “nodes” have private memory, communicate by “message passing.”

xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory xPU Memory

slide-98
SLIDE 98

vuduc.org/cse6230

An abstract distributed memory machine:

Processing “nodes” have private memory, communicate by “message passing.”

  • Relevant even on 1 node, due to

non-uniform memory access (NUMA).

Core0 Core1 Core2 Core3

Socket-0

DRAM 1 2 3

Socket-1

DRAM

Example: Two quad-core CPUs with logically shared but physically distributed memory

slide-99
SLIDE 99

vuduc.org/cse6230

Core0 Core1 Core2 Core3

CPU 1

DRAM

1

Infiniband

1 2 3

CPU 2

DRAM

QPI DDR3 QPI

I/O hub I/O hub

QPI

integrated PCIe x16 PCIe x16 PCIe x16

GPU 1 GPU 2 GPU 3

Node

Example: Adding GPUs, which have private memory address spaces, to the previous example