GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Warps as Scheduling Units Block 1 Warps Block 2 Warps Block 3 Warps … … … t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 … … … – Each block is divided into 32-thread warps – An implementation technique, not part of the CUDA programming model – Warps are scheduling units in SM – Threads in a warp execute in Single Instruction Multiple Data (SIMD) manner – The number of threads in a warp may vary in future generations 3

Warps in Multi-dimensional Thread Blocks – The thread blocks are first linearized into 1D in row major order – In x-dimension first, y-dimension next, and z-dimension last Figure 6.1: Placing 2D threads into linear order 4 4

Blocks are partitioned after linearization – Linearized thread blocks are partitioned – Thread indices within a warp are consecutive and increasing – Warp 0 starts with Thread 0 – Partitioning scheme is consistent across devices – Thus you can use this knowledge in control flow – However, the exact size of warps may change from generation to generation – DO NOT rely on any ordering within or between warps – If there are any dependencies between threads, you must __syncthreads() to get correct results (more later). 5

SIMD Execution Among Threads in a Warp – All threads in a warp must execute the same instruction at any point in time – This works efficiently if all threads follow the same control flow path – All if-then-else statements make the same decision – All loops iterate the same number of times 8

Branch Divergence in Warps • occurs when threads inside warps branches to different execution paths. Branch Branch Path A Path A Path B Path B 50% performance loss 18

Control Divergence – Control divergence occurs when threads in a warp take different control flow paths by making different control decisions – Some take the then-path and others take the else-path of an if-statement – Some threads take different number of loop iterations than others – The execution of threads taking different paths are serialized in current GPUs – The control paths taken by the threads in a warp are traversed one at a time until there is no more. – During the execution of each path, all threads taking that path will be executed in parallel – The number of different paths can be large when considering nested control flow statements 9

Dealing With Branch Divergence A common case: avoid divergence when branch • condition is a function of thread ID Example with divergence: – • If (threadIdx.x > 2) { } This creates two different control paths for threads in a • block Example without divergence: – • If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a • block Branch granularity is a whole multiple of warp size; all • threads in any given warp follow the same path • There is a big body of research for dealing with branch divergence

Control Divergence Examples – Divergence can arise when branch or loop condition is a function of thread indices – Example kernel statement with divergence: – – This creates two different control paths for threads in a block – Decision granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp – Example without divergence: – – Decision granularity is a multiple of blocks size; all threads in any given warp follow the same path 10

Example: Vector Addition Kernel Device Code // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } 11 11

Analysis for vector size of 1,000 elements – Assume that block size is 256 threads – 8 warps in each block – All threads in Blocks 0, 1, and 2 are within valid range – i values from 0 to 767 – There are 24 warps in these three blocks, none will have control divergence – Most warps in Block 3 will not control divergence – Threads in the warps 0-6 are all within valid range, thus no control divergence – One warp in Block 3 will have control divergence – Threads with i values 992-999 will all be within valid range – Threads with i values of 1000-1023 will be outside valid range – Effect of serialization on control divergence will be small – 1 out of 32 warps has control divergence – The impact on performance will likely be less than 3% 12

Parallel Reduction (max / sum / etc. ) 13 25

One Parallel Reduction Kernel __shared__ float partialSum[SIZE]; partialSum[threadIdx.x] = X[blockIdx.x*blockDim.x + threadIdx.x]; unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t+stride]; } 26 26

One Parallel Reduction Kernel __shared__ float partialSum[SIZE]; partialSum[threadIdx.x] = X[blockIdx.x*blockDim.x + threadIdx.x]; unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t+stride]; } t 0 t1 t 2 t3 t 4 t5 t6 t7 27 27

One Parallel Reduction Kernel __shared__ float partialSum[SIZE]; partialSum[threadIdx.x] = X[blockIdx.x*blockDim.x + threadIdx.x]; unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t+stride]; Warp 1 Warp 2 } t 0 t1 t 2 t3 t 4 t5 t6 t7 28 28

A Better Parallel Reduction Kernel __shared__ float partialSum[SIZE]; partialSum[threadIdx.x] = X[blockIdx.x*blockDim.x + threadIdx.x]; unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x/2; stride >= 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } 29 29

A Better Parallel Reduction Kernel __shared__ float partialSum[SIZE]; partialSum[threadIdx.x] = X[blockIdx.x*blockDim.x + threadIdx.x]; unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x/2; stride >= 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } Thread 0 Thread 1 Thread 2 Thread 3 Thread 5 Thread 6 Thread 7 Thread 8 3 1 7 0 4 1 6 3 7 2 13 3 20 5 25 30 30

Example of underutilization Computational Resource Utilization 100% 90% 80% Good 70% 32 60% 24 to 31 50% 40% 16 to 23 30% 8 to 15 20% 1 to 7 10% 0 0% Bad 32 warps, 32 threads per warp, round-robin scheduling

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. Warps as Scheduling Units Block 1 Warps Block 2

GPs and GP trainees Introduction Expanding GP teaching Importance of GP teaching &

Inclusive Teaching Practices Alicia Cundell Teaching Consultant Centre for Teaching &

Teaching-Research NYCity model Teaching Research Cycle Teaching-Research NYCity Model

Who are we teaching and Who are we teaching and how do we teach them? how do we teach them?

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

SOCIAL JUSTICE 101: Based on Materials from Teaching Tolerance About Teaching Tolerance

Distance Teaching and Learning Distance Teaching and Learning without Walls: without Walls:

Designing Curricula Why are you here? Agenda setting: What questions do you have? What

Teaching Slides For Respiratory Disease Teaching Slides For Respiratory Disease Teaching Slides

to Support Great Teaching Presented at: 2014 Conference on Teaching: Supporting Great Teaching

On the Level of Teaching Heaven Teaching Principles Earth 1 Four Principles Stepwise

Conference on Teaching: Supporting Great Teaching Pathways to a Great Profession Lee-Ann

Narrative Bible Teaching What is it? Principles of NBT Narrative teaching works Active

Interview Preparation FOR TEACHING & NON-TEACHING JOBS Career Development Centre 2013

SOTL What is it good for? Scholarship of Teaching and Learning > Can inform teaching > Is

about teaching excellence? Introduction and background Creation of Teaching Excellence

DataCollider: Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi,

Concurrency: Threads Questions answered in this lecture: Why is concurrency useful? What is a

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

Slides on thr Slides on threads eads borr borrowed by Chase owed by Chase Landon Cox Landon

FINISH SOME LEFTOVER C++ TOPICS Professor Ken Birman THEN: DEADLOCKS, LIVELOCKS CS4414 Lecture

Real Time Java Real Time Java Filip Pizlo , Jan Vitek Filip Pizlo , Jan Vitek Purdue University

Commissioning of The CMS Forward Pixel Detector Ashish Kumar SUNY Buffalo (for the CMS FPix

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. Warps as Scheduling Units Block 1 Warps Block 2

GPs and GP trainees Introduction Expanding GP teaching Importance of GP teaching &amp;

Inclusive Teaching Practices Alicia Cundell Teaching Consultant Centre for Teaching &amp;

Teaching-Research NYCity model Teaching Research Cycle Teaching-Research NYCity Model

Who are we teaching and Who are we teaching and how do we teach them? how do we teach them?

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

SOCIAL JUSTICE 101: Based on Materials from Teaching Tolerance About Teaching Tolerance

Distance Teaching and Learning Distance Teaching and Learning without Walls: without Walls:

Designing Curricula Why are you here? Agenda setting: What questions do you have? What

Teaching Slides For Respiratory Disease Teaching Slides For Respiratory Disease Teaching Slides

to Support Great Teaching Presented at: 2014 Conference on Teaching: Supporting Great Teaching

On the Level of Teaching Heaven Teaching Principles Earth 1 Four Principles Stepwise

Conference on Teaching: Supporting Great Teaching Pathways to a Great Profession Lee-Ann

Narrative Bible Teaching What is it? Principles of NBT Narrative teaching works Active

Interview Preparation FOR TEACHING &amp; NON-TEACHING JOBS Career Development Centre 2013

SOTL What is it good for? Scholarship of Teaching and Learning &gt; Can inform teaching &gt; Is

about teaching excellence? Introduction and background Creation of Teaching Excellence

DataCollider: Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi,

Concurrency: Threads Questions answered in this lecture: Why is concurrency useful? What is a

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

Slides on thr Slides on threads eads borr borrowed by Chase owed by Chase Landon Cox Landon

FINISH SOME LEFTOVER C++ TOPICS Professor Ken Birman THEN: DEADLOCKS, LIVELOCKS CS4414 Lecture

Real Time Java Real Time Java Filip Pizlo , Jan Vitek Filip Pizlo , Jan Vitek Purdue University

Commissioning of The CMS Forward Pixel Detector Ashish Kumar SUNY Buffalo (for the CMS FPix

GPs and GP trainees Introduction Expanding GP teaching Importance of GP teaching &

Inclusive Teaching Practices Alicia Cundell Teaching Consultant Centre for Teaching &

Interview Preparation FOR TEACHING & NON-TEACHING JOBS Career Development Centre 2013

SOTL What is it good for? Scholarship of Teaching and Learning > Can inform teaching > Is