Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer - PowerPoint PPT Presentation

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Tuesday, September 11, 12

Parallel Reduction Tree-based approach used within each thread block 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 Need to be able to use multiple thread blocks To process very large arrays To keep all multiprocessors on the GPU busy Each thread block reduces a portion of the array But how do we communicate partial results between thread blocks? 3 Tuesday, September 11, 12

Solution: Kernel Decomposition Avoid global sync by decomposing computation into multiple kernel invocations 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 Level 0: 25 25 25 25 25 25 25 25 8 blocks Level 1: 3 1 7 0 4 1 6 3 4 7 5 9 1 block 11 14 25 In the case of reductions, code for all levels is the same Recursive kernel invocation 5 Tuesday, September 11, 12

What is Our Optimization Goal? We should strive to reach GPU peak performance Choose the right metric: GFLOP/s: for compute-bound kernels Bandwidth: for memory-bound kernels Reductions have very low arithmetic intensity 1 flop per element loaded (bandwidth-optimal) Therefore we should strive for peak bandwidth Will use G80 GPU for this example 384-bit memory interface, 900 MHz DDR 384 * 1800 / 8 = 86.4 GB/s Jinx (m2090 nodes) ~ 100 to 150 GB/s? 6 Tuesday, September 11, 12

Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) Tuesday, September 11, 12

Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) Step 1 Thread 2 4 6 8 10 12 14 0 Stride 1 IDs Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Step 2 Thread 4 8 12 0 Stride 2 IDs Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Step 3 Thread 8 0 Stride 4 IDs Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Step 4 Thread 0 Stride 8 IDs Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 8 Tuesday, September 11, 12

Reduction #1: Interleaved Addressing __global__ void reduce0(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } 7 Tuesday, September 11, 12

Performance for 4M element reduction Time (2 22 ints) Bandwidth Kernel 1: 8.054 ms 2.083 GB/s interleaved addressing with divergent branching Note: Block Size = 128 threads for all tests Jinx: Kernel ~ Bandwidth Kernel 1 ~ 10.6 GB/s 10 Tuesday, September 11, 12

Reduction #1: Interleaved Addressing __global__ void reduce1(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { Problem: highly divergent sdata[tid] += sdata[tid + s]; warps are very inefficient, and } % operator is very slow __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } 9 Tuesday, September 11, 12

Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) Step 1 Thread 2 4 6 8 10 12 14 0 Stride 1 IDs Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Step 2 Thread 4 8 12 0 Stride 2 IDs Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Step 3 Thread 8 0 Stride 4 IDs Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Step 4 Thread 0 Stride 8 IDs Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 8 Tuesday, September 11, 12

Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) Step 1 Thread 1 2 3 4 5 6 7 0 Stride 1 IDs Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Step 2 Thread 1 2 3 0 Stride 2 IDs Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Step 3 Thread 1 0 Stride 4 IDs Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Step 4 Thread 0 Stride 8 IDs Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 New Problem: Shared Memory Bank Conflicts 12 Tuesday, September 11, 12

Reduction #2: Interleaved Addressing Just replace divergent branch in inner loop: for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } With strided index and non-divergent branch: for (unsigned int s=1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } __syncthreads(); } 11 Tuesday, September 11, 12

Performance for 4M element reduction Step Cumulative Time (2 22 ints) Bandwidth Speedup Speedup Kernel 1: 8.054 ms 2.083 GB/s interleaved addressing with divergent branching Kernel 2: 3.456 ms 4.854 GB/s 2.33x 2.33x interleaved addressing with bank conflicts Jinx: Kernel ~ Bandwidth → Speedup Kernel 1 ~ 10.6 GB/s Kernel 2 ~ 16.4 GB/s → 1.55x 13 Tuesday, September 11, 12

Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) Step 1 Thread 1 2 3 4 5 6 7 0 Stride 1 IDs Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Step 2 Thread 1 2 3 0 Stride 2 IDs Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Step 3 Thread 1 0 Stride 4 IDs Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Step 4 Thread 0 Stride 8 IDs Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 New Problem: Shared Memory Bank Conflicts 12 Tuesday, September 11, 12

0 1 2 3 4 5 6 7 8 9 Array Example : Array mapped to 4 banks. Maximum throughput occurs when delivering 1 word per bank. 0 1 2 3 Shared memory 4 5 6 7 8 9 Tuesday, September 11, 12

Parallel Reduction: Sequential Addressing Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Thread Step 1 IDs Stride 8 1 2 3 4 5 6 7 0 Values 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 Step 2 Thread Stride 4 IDs 1 2 3 0 Values 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Step 3 Thread 1 0 Stride 2 IDs Values 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Thread Step 4 0 IDs Stride 1 Values 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Sequential addressing is conflict free 14 Tuesday, September 11, 12

Reduction #3: Sequential Addressing Just replace strided indexing in inner loop: for (unsigned int s=1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } __syncthreads(); } With reversed loop and threadID-based indexing: for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } 15 Tuesday, September 11, 12

Performance for 4M element reduction Step Cumulative Time (2 22 ints) Bandwidth Speedup Speedup Kernel 1: 8.054 ms 2.083 GB/s interleaved addressing with divergent branching Kernel 2: 3.456 ms 4.854 GB/s 2.33x 2.33x interleaved addressing with bank conflicts Kernel 3: 1.722 ms 9.741 GB/s 2.01x 4.68x sequential addressing Jinx: Kernel ~ Bandwidth → Speedup (Cumulative speedup) Kernel 1 ~ 10.6 GB/s Kernel 2 ~ 16.4 GB/s → 1.55x Kernel 3 ~ 22.3 GB/s → 1.35x (2.10x) 16 Tuesday, September 11, 12

Idle Threads Problem: for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } Half of the threads are idle on first loop iteration! This ¡is ¡wasteful… 17 Tuesday, September 11, 12

Reduction #4: First Add During Load Halve the number of blocks, and replace single load: // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); With two loads and first add of the reduction: // perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); 18 Tuesday, September 11, 12

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer - PowerPoint PPT Presentation

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Tuesday, September 11, 12 Parallel Reduction Tree-based approach used within each thread block

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu ,

Using the Interactive Parallelization Tool to Generate Parallel Programs (OpenMP, MPI, and CUDA )

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted

Combining Text and Image Processing in an Automa6c Image

Security for Cloud & Big Data CS 161: Computer Security Prof. David Wagner April 18, 2013

A Model for Recommending Research Articles: A Case Study in Computer Science, Neuroscience and

A Keyword-based ICN-IoT Platform Function tag Hashtags des Hierarchical Part z }| { z}|{ z }|

For Tuesday: Finish HW5 "Become a Requester" (Warning: you need to register as a

Enhancing student project selection and allocation in higher education programmes Johann A. Briffa

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta,

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer - PowerPoint PPT Presentation

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Tuesday, September 11, 12 Parallel Reduction Tree-based approach used within each thread block

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu ,

Using the Interactive Parallelization Tool to Generate Parallel Programs (OpenMP, MPI, and CUDA )

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted

Combining Text and Image Processing in an Automa6c Image

Security for Cloud &amp; Big Data CS 161: Computer Security Prof. David Wagner April 18, 2013

A Model for Recommending Research Articles: A Case Study in Computer Science, Neuroscience and

A Keyword-based ICN-IoT Platform Function tag Hashtags des Hierarchical Part z }| { z}|{ z }|

For Tuesday: Finish HW5 &quot;Become a Requester&quot; (Warning: you need to register as a

Enhancing student project selection and allocation in higher education programmes Johann A. Briffa

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta,

Security for Cloud & Big Data CS 161: Computer Security Prof. David Wagner April 18, 2013

For Tuesday: Finish HW5 "Become a Requester" (Warning: you need to register as a