Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - - PowerPoint PPT Presentation

cartoon parallel architectures cpus and gpus
SMART_READER_LITE
LIVE PREVIEW

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - - PowerPoint PPT Presentation

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD


slide-1
SLIDE 1

“Cartoon” parallel architectures; CPUs and GPUs

CSE 6230, Fall 2014 Th Sep 11

  • Thanks to Jee Choi (a senior PhD student) for a big “assist”

1

slide-2
SLIDE 2

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

slide-15
SLIDE 15

14

~ socket

slide-16
SLIDE 16

14

~ core

slide-17
SLIDE 17

14

~ HWMT+SIMD

(“SIMT”)

slide-18
SLIDE 18

14

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”
slide-19
SLIDE 19

14

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”

~ 500 GF/s

(single)

slide-20
SLIDE 20

14

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”

~ 4 TF/s

(single)

~ 500 GF/s

(single)

slide-21
SLIDE 21

15

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”
slide-22
SLIDE 22

15

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”

~ 50 GB/s

slide-23
SLIDE 23

15

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”

~ 50 GB/s ~ 250 GB/s

slide-24
SLIDE 24

15

Intel E5-2687W

  • vs. NVIDIA K20X

“Sandy Bridge-EP”

  • vs. “Kepler”

~ 50 GB/s ~ 250 GB/s

6 GB/s

slide-25
SLIDE 25

System Comparison

Intel Xeon E5-2687W NVIDIA K20X Difference # Cores/SMX 8 14 1.75× Clock frequency (max) 3.8 GHz 735 MHz 0.20× SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance (single precision) 8 cores × 3.8 GHz × (8 Add + 8 Mul) = 2688 × 735 MHz × 2 (FMA) = 8.12× Performance (double precision) 8 cores × 3.8 GHz × (4 Add + 4 Mul) = 896 × 735 MHz × 2 (FMA) = 5.42× Memory bandwidth 51.2 GB/s 250 GB/s 4.88× TDP 150 W 235 W 1.57×

slide-26
SLIDE 26

17

slide-27
SLIDE 27

17

slide-28
SLIDE 28

17

6 GB/s

slide-29
SLIDE 29

18

slide-30
SLIDE 30

19

slide-31
SLIDE 31

20

slide-32
SLIDE 32

21

slide-33
SLIDE 33

22

slide-34
SLIDE 34

23

slide-35
SLIDE 35

24

slide-36
SLIDE 36

“CUDA” is NVIDIA’s implementation of this execution model

slide-37
SLIDE 37

Thread hierarchy

“Single instruction multiple thread” (SIMT)

slide-38
SLIDE 38

An example to compare models

for (i=0; i<N; i++) A[i] += 2;

Naïve:

#pragma omp parallel for for (i=0; i<N; i++) A[i] += 2;

OpenMP:

int i = f(global thread ID); A[i] += 2;

CUDA, with N threads:

slide-39
SLIDE 39

Global thread IDs

threadIdx.x 1 2 3 1 2 3 1 2 3 blockIdx.x 1 3 global ID 1 2 3 2

1 2 3 … … 15

slide-40
SLIDE 40

Global thread IDs

threadIdx.x 1 2 3 1 2 3 1 2 3 blockIdx.x 1 3 global ID

A

1 2 3 2

1 2 3 … … 15

slide-41
SLIDE 41

Thread hierarchy

  • Given a 3-D grid of thread blocks

– there are (gridDim.x*gridDim.y*gridDim.z) thread blocks in the grid ¡ – each block’s position is identified by blockIdx.x,

blockIdx.y, and blockIdx.z ¡

  • Similarly for a 3-D thread block ¡

– blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡

  • Thread-to-data mapping depends on how the

work is divided amongst the threads

slide-42
SLIDE 42

Memory hierarchy

thread thread block grid variables shared memory global memory local memory constant memory (read-only) texture memory (read-only)

slide-43
SLIDE 43

CUDA by example
 Basic CUDA code

__global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x;

  • ut[gId] = in[gId];

}

  • int main (int argc, char** argv)

{ int N = 1048576; in tbSize = 256;

  • int nBlocks = N / tbSize;
  • dim3 grid (nBlocks);

dim3 block (tbSize);

  • test <<<grid, block>>> (d_in, d_out, N);

cudaThreadSynchronize (); }

slide-44
SLIDE 44

CUDA by example
 Basic CUDA code

int main (int argc, char** argv) { /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int));

  • /* copy data from device to host */

cudaMemcpy (d_in, h_in, N * sizeof (int), cudaMemcpyHostToDevice);

  • /* body of the problem here */

. . . /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) cudaFree (d_in); cudaFree (d_out); }

allocate memory

  • n device

Copy data from CPU to GPU Copy data from GPU to CPU free memory

slide-45
SLIDE 45

CUDA by example
 What is this code doing?

__global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y;

  • __shared__ buffer[16][16];
  • buffer[tidx][tidy] =

in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads();

  • if(tidx > 0 && tidy > 0) {

int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else {

/* take care of boundary conditions */

} in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

slide-46
SLIDE 46

CUDA by example
 What is this code doing?

why do we need this? shared memory

__global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y;

  • __shared__ buffer[16][16];
  • buffer[tidx][tidy] =

in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads();

  • if(tidx > 0 && tidy > 0) {

int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else {

/* take care of boundary conditions */

} in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

slide-47
SLIDE 47

Synchronization

  • Within a thread block

– via __syncthreads (); ¡

  • Global synchronization

– implicit synchronization between kernels ¡ – only way to synchronize globally is to finish the grid and start another grid

slide-48
SLIDE 48

Scheduling

  • Each thread block gets scheduled on a

multiprocessor (SMX) ¡

– there is no guarantee in the order in which they get scheduled ¡ – thread blocks run independently to each other ¡

  • Multiple thread blocks can reside on a single SMX

simultaneously (occupancy) ¡

– the number of thread blocks is determined by the resource usage and availability (shared memory and registers) ¡

  • Once scheduled, each thread blocks runs to

completion

slide-49
SLIDE 49

Execution

  • Minimum unit of execution: warp ¡

– typically 32 threads ¡

  • At any given time, multiple warps will be executing ¡

– could be from the same or different thread blocks ¡

  • A warp of threads could be either ¡

– executing ¡ – waiting (for data or their turn) ¡

  • When a warp gets stalled, they could be switched out

“instantaneously” so that another warp can start executing ¡

– hardware multi-threading

slide-50
SLIDE 50

Performance Notes


Thread Divergence

  • On a branch, threads in a warp can

diverge

– execution is serialized – threads taking one branch executes while others idle ¡

  • Avoid divergence!!!

– use bitwise operation when possible ¡ – diverge at granularity of warps (no penalty)

slide-51
SLIDE 51

Performance Notes


Occupancy

  • Occupancy = # resident warps / max # warps ¡

– # resident warps is determined by per-thread register and per-block shared memory usage ¡ – max # warps is specific to the hardware generation ¡

  • More warps means more threads with which to

hide latency ¡

– increases the chance of keeping the GPU busy at all times ¡ – does not necessarily mean better performance

slide-52
SLIDE 52

Performance Notes


Bandwidth Utilization

  • Reading from the DRAM occurs at the

granularity of 128 Byte transactions ¡

– requests are further decomposed to aligned cache lines ¡

  • read-only cache:128 Bytes
  • L2 cache: 32 Bytes
  • Minimize loading redundant cache lines to

maximize bandwidth utilization ¡

– aligned access to memory ¡ – sequential access pattern

slide-53
SLIDE 53

Performance Notes


Bandwidth Utilization

slide-54
SLIDE 54

Performance Notes


Bandwidth Utilization

slide-55
SLIDE 55

Performance Notes


Bandwidth Utilization

slide-56
SLIDE 56

44

Backup

slide-57
SLIDE 57

GPU Architecture

slide-58
SLIDE 58

Performance Notes


Bandwidth Utilization II

  • Little’s Law

– L = λW ¡

  • L = average number of customers in a store
  • λ = arrival rate ¡
  • W = average time spent
slide-59
SLIDE 59

Performance Notes


Bandwidth Utilization II

  • Little’s Law

– L = λW ¡

  • L = average number of customers in a store
  • λ = arrival rate ¡
  • W = average time spent ¡
  • Memory Bandwidth

Latency (W) Bandwidth (λ)

slide-60
SLIDE 60

Performance Notes


Bandwidth Utilization II

  • Little’s Law

– L = λW ¡

  • L = average number of customers in a store
  • λ = arrival rate ¡
  • W = average time spent ¡
  • Memory Bandwidth

Latency (W) Bandwidth (λ) tens of thousands of in-flight requests!!!

slide-61
SLIDE 61

In summary

  • Use as many “cheap” threads as possible ¡

– maximizes occupancy ¡ – increases the number of memory requests ¡

  • Avoid thread divergence ¡

– if unavoidable, diverge at the warp level ¡

  • Use aligned and sequential data access

pattern ¡

– minimize redundant data loads

slide-62
SLIDE 62

CUDA by example
 Quicksort

  • Let’s now consider quicksort on a GPU
  • Step 1 Partition the initial list

– how do we partition the list amongst thread blocks? ¡ – recall that thread blocks CANNOT co-operate and thread blocks can go in ANY order ¡ – however, we need to have MANY threads and thread blocks in order to see good performance

slide-63
SLIDE 63

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3

slide-64
SLIDE 64

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1

slide-65
SLIDE 65

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot (5)

>= pivot (5)

2 1 1 2 2 1 1 2 1 1 1 1

slide-66
SLIDE 66

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1 2 2 2 1 1 1 3 1 1 2 2

Do a cumulative sum on < pivot and >= pivot This should be done in shared memory in parallel

slide-67
SLIDE 67

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1 2 2 2 1 1 1 3 1 1 2 2

This tells us how much space and where each thread block needs to store its values

slide-68
SLIDE 68

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1

temporary array

start end

slide-69
SLIDE 69

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1

temporary array

start end

atomic fetch-and-add (FAA)

slide-70
SLIDE 70

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1

temporary array

start end

atomic fetch-and-add (FAA)

slide-71
SLIDE 71

CUDA by example
 Quicksort

4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1

thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot

>= pivot

2 3 1

4 3 2 5 temporary array

start end

atomic fetch-and-add (FAA)

slide-72
SLIDE 72

CUDA by example
 Quicksort

  • Phew. That was the first part.
  • This is repeated until there are enough

independent partitions that can be assigned to thread blocks

  • In the next part, each thread block will do

something similar minus the FAA

  • When sequences become small enough, you

can sort it using an alternative sorting algorithm (e.g., bitonic sort)