cartoon parallel architectures cpus and gpus
play

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - PowerPoint PPT Presentation

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD


  1. “Cartoon” parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 � Thanks to Jee Choi (a senior PhD student) for a big “assist” 1

  2. 2

  3. 3

  4. 4

  5. 5

  6. 6

  7. 7

  8. 8

  9. 9

  10. 10

  11. 11

  12. 12

  13. 13

  14. 14

  15. ~ socket 14

  16. ~ core 14

  17. ~ HWMT+SIMD (“SIMT”) 14

  18. Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  19. ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  20. ~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  21. Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  22. ~ 50 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  23. ~ 50 GB/s ~ 250 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  24. ~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  25. System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 × Clock frequency 3.8 GHz 735 MHz 0.20 × (max) SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance 8 cores × 3.8 GHz × 2688 × 735 MHz × 8.12 × (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores × 3.8 GHz × 896 × 735 MHz × 5.42 × (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 × TDP 150 W 235 W 1.57 ×

  26. 17

  27. 17

  28. 6 GB/s 17

  29. 18

  30. 19

  31. 20

  32. 21

  33. 22

  34. 23

  35. 24

  36. “CUDA” is NVIDIA’s implementation of this execution model

  37. Thread hierarchy “Single instruction multiple thread” ( SIMT )

  38. An example to compare models OpenMP: Naïve: #pragma omp parallel for for (i=0; i<N; i++) for (i=0; i<N; i++) A[i] += 2; A[i] += 2; CUDA , with N threads: int i = f(global thread ID) ; A[i] += 2;

  39. Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x 0 1 2 3 … … 15 global ID

  40. Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x A 0 1 2 3 … … 15 global ID

  41. Thread hierarchy • Given a 3-D grid of thread blocks – there are (gridDim.x*gridDim.y*gridDim.z) thread blocks in the grid ¡ – each block’s position is identified by blockIdx.x , blockIdx.y , and blockIdx.z ¡ • Similarly for a 3-D thread block ¡ – blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡ • Thread-to-data mapping depends on how the work is divided amongst the threads

  42. Memory hierarchy variables thread local memory thread shared memory block global memory grid constant memory (read-only) texture memory (read-only)

  43. CUDA by example 
 Basic CUDA code __global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x; out[gId] = in[gId]; } � int main (int argc, char** argv) { int N = 1048576; in tbSize = 256; � int nBlocks = N / tbSize; � dim3 grid (nBlocks); dim3 block (tbSize); � test <<<grid, block>>> (d_in, d_out, N); cudaThreadSynchronize (); }

  44. CUDA by example 
 Basic CUDA code int main (int argc, char** argv) allocate memory { on device /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int)); � /* copy data from device to host */ Copy data from cudaMemcpy (d_in, h_in, N * sizeof (int), CPU to GPU cudaMemcpyHostToDevice); � /* body of the problem here */ . . . Copy data from /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), GPU to CPU cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) free memory cudaFree (d_in); cudaFree (d_out); }

  45. CUDA by example 
 What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads(); � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

  46. CUDA by example 
 What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � shared memory __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; why do we need __syncthreads(); this? � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

  47. Synchronization • Within a thread block – via __syncthreads (); ¡ • Global synchronization – implicit synchronization between kernels ¡ – only way to synchronize globally is to finish the grid and start another grid

  48. Scheduling • Each thread block gets scheduled on a multiprocessor (SMX) ¡ – there is no guarantee in the order in which they get scheduled ¡ – thread blocks run independently to each other ¡ • Multiple thread blocks can reside on a single SMX simultaneously (occupancy) ¡ – the number of thread blocks is determined by the resource usage and availability (shared memory and registers) ¡ • Once scheduled, each thread blocks runs to completion

  49. Execution Minimum unit of execution: warp ¡ • – typically 32 threads ¡ At any given time, multiple warps will be executing ¡ • – could be from the same or different thread blocks ¡ A warp of threads could be either ¡ • – executing ¡ – waiting (for data or their turn) ¡ When a warp gets stalled, they could be switched out • “instantaneously” so that another warp can start executing ¡ – hardware multi-threading

  50. Performance Notes 
 Thread Divergence • On a branch, threads in a warp can diverge – execution is serialized – threads taking one branch executes while others idle ¡ • Avoid divergence!!! – use bitwise operation when possible ¡ – diverge at granularity of warps (no penalty)

  51. Performance Notes 
 Occupancy • Occupancy = # resident warps / max # warps ¡ – # resident warps is determined by per-thread register and per-block shared memory usage ¡ – max # warps is specific to the hardware generation ¡ • More warps means more threads with which to hide latency ¡ – increases the chance of keeping the GPU busy at all times ¡ – does not necessarily mean better performance

  52. Performance Notes 
 Bandwidth Utilization • Reading from the DRAM occurs at the granularity of 128 Byte transactions ¡ – requests are further decomposed to aligned cache lines ¡ • read-only cache:128 Bytes • L2 cache: 32 Bytes • Minimize loading redundant cache lines to maximize bandwidth utilization ¡ – aligned access to memory ¡ – sequential access pattern

  53. Performance Notes 
 Bandwidth Utilization

  54. Performance Notes 
 Bandwidth Utilization

  55. Performance Notes 
 Bandwidth Utilization

  56. Backup 44

  57. GPU Architecture

  58. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent

  59. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth Bandwidth ( λ ) Latency (W)

  60. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)

  61. In summary • Use as many “cheap” threads as possible ¡ – maximizes occupancy ¡ – increases the number of memory requests ¡ • Avoid thread divergence ¡ – if unavoidable, diverge at the warp level ¡ • Use aligned and sequential data access pattern ¡ – minimize redundant data loads

  62. CUDA by example 
 Quicksort • Let’s now consider quicksort on a GPU • Step 1 Partition the initial list – how do we partition the list amongst thread blocks? ¡ – recall that thread blocks CANNOT co-operate and thread blocks can go in ANY order ¡ – however, we need to have MANY threads and thread blocks in order to see good performance

  63. CUDA by example 
 Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 thread thread thread thread block 0 block 1 block 2 block 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend