Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - PowerPoint PPT Presentation

“Cartoon” parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 � Thanks to Jee Choi (a senior PhD student) for a big “assist” 1

~ socket 14

~ core 14

~ HWMT+SIMD (“SIMT”) 14

Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

~ 50 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

~ 50 GB/s ~ 250 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 × Clock frequency 3.8 GHz 735 MHz 0.20 × (max) SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance 8 cores × 3.8 GHz × 2688 × 735 MHz × 8.12 × (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores × 3.8 GHz × 896 × 735 MHz × 5.42 × (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 × TDP 150 W 235 W 1.57 ×

6 GB/s 17

“CUDA” is NVIDIA’s implementation of this execution model

Thread hierarchy “Single instruction multiple thread” ( SIMT )

An example to compare models OpenMP: Naïve: #pragma omp parallel for for (i=0; i<N; i++) for (i=0; i<N; i++) A[i] += 2; A[i] += 2; CUDA , with N threads: int i = f(global thread ID) ; A[i] += 2;

Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x 0 1 2 3 … … 15 global ID

Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x A 0 1 2 3 … … 15 global ID

Thread hierarchy • Given a 3-D grid of thread blocks – there are (gridDim.x*gridDim.y*gridDim.z) thread blocks in the grid ¡ – each block’s position is identified by blockIdx.x , blockIdx.y , and blockIdx.z ¡ • Similarly for a 3-D thread block ¡ – blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡ • Thread-to-data mapping depends on how the work is divided amongst the threads

Memory hierarchy variables thread local memory thread shared memory block global memory grid constant memory (read-only) texture memory (read-only)

CUDA by example   Basic CUDA code __global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x; out[gId] = in[gId]; } � int main (int argc, char** argv) { int N = 1048576; in tbSize = 256; � int nBlocks = N / tbSize; � dim3 grid (nBlocks); dim3 block (tbSize); � test <<<grid, block>>> (d_in, d_out, N); cudaThreadSynchronize (); }

CUDA by example   Basic CUDA code int main (int argc, char** argv) allocate memory { on device /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int)); � /* copy data from device to host */ Copy data from cudaMemcpy (d_in, h_in, N * sizeof (int), CPU to GPU cudaMemcpyHostToDevice); � /* body of the problem here */ . . . Copy data from /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), GPU to CPU cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) free memory cudaFree (d_in); cudaFree (d_out); }

CUDA by example   What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads(); � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

CUDA by example   What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � shared memory __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; why do we need __syncthreads(); this? � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

Synchronization • Within a thread block – via __syncthreads (); ¡ • Global synchronization – implicit synchronization between kernels ¡ – only way to synchronize globally is to finish the grid and start another grid

Scheduling • Each thread block gets scheduled on a multiprocessor (SMX) ¡ – there is no guarantee in the order in which they get scheduled ¡ – thread blocks run independently to each other ¡ • Multiple thread blocks can reside on a single SMX simultaneously (occupancy) ¡ – the number of thread blocks is determined by the resource usage and availability (shared memory and registers) ¡ • Once scheduled, each thread blocks runs to completion

Execution Minimum unit of execution: warp ¡ • – typically 32 threads ¡ At any given time, multiple warps will be executing ¡ • – could be from the same or different thread blocks ¡ A warp of threads could be either ¡ • – executing ¡ – waiting (for data or their turn) ¡ When a warp gets stalled, they could be switched out • “instantaneously” so that another warp can start executing ¡ – hardware multi-threading

Performance Notes   Thread Divergence • On a branch, threads in a warp can diverge – execution is serialized – threads taking one branch executes while others idle ¡ • Avoid divergence!!! – use bitwise operation when possible ¡ – diverge at granularity of warps (no penalty)

Performance Notes   Occupancy • Occupancy = # resident warps / max # warps ¡ – # resident warps is determined by per-thread register and per-block shared memory usage ¡ – max # warps is specific to the hardware generation ¡ • More warps means more threads with which to hide latency ¡ – increases the chance of keeping the GPU busy at all times ¡ – does not necessarily mean better performance

Performance Notes   Bandwidth Utilization • Reading from the DRAM occurs at the granularity of 128 Byte transactions ¡ – requests are further decomposed to aligned cache lines ¡ • read-only cache:128 Bytes • L2 cache: 32 Bytes • Minimize loading redundant cache lines to maximize bandwidth utilization ¡ – aligned access to memory ¡ – sequential access pattern

Performance Notes   Bandwidth Utilization

Backup 44

GPU Architecture

Performance Notes   Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent

Performance Notes   Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth Bandwidth ( λ ) Latency (W)

Performance Notes   Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)

In summary • Use as many “cheap” threads as possible ¡ – maximizes occupancy ¡ – increases the number of memory requests ¡ • Avoid thread divergence ¡ – if unavoidable, diverge at the warp level ¡ • Use aligned and sequential data access pattern ¡ – minimize redundant data loads

CUDA by example   Quicksort • Let’s now consider quicksort on a GPU • Step 1 Partition the initial list – how do we partition the list amongst thread blocks? ¡ – recall that thread blocks CANNOT co-operate and thread blocks can go in ANY order ¡ – however, we need to have MANY threads and thread blocks in order to see good performance

CUDA by example   Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 thread thread thread thread block 0 block 1 block 2 block 3

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - PowerPoint PPT Presentation

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

Philippine American War Political Cartoons Example 1: Judge , 1899 Example 2: Life , 1899 Set

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Architectures Architectural styles Software architectures Architectures versus middleware

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Multiprocessors and Multithreading Jason Mars Sunday, March 3, 13 Parallel Architectures for

Architectures for Parallel Processing Current Architectures for Parallel "With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab

1,202 CPUs and 176 GPUs used by the Google DeepMind AI to win against the Go game

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios

Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

Using GPUs as CPUs for Engineering Applications: Challenges and Issues Michael A. Heroux Sandia

Milestones in Descriptive Complexity Theory ANU Logic Summer School, 2019 Presented by Sasha

Stronger Connections Between Circuit Analysis and Circuit Lower Bounds, via PCPs of Proximity

THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION FRAMEWORK LLVM-Dev19

Heuristic (Informed) search strategy Search Algorithm #2 Search SEARCH#2 1.

An Improved LP-Based Approximation for Steiner Tree Fabrizio Grandoni Tor Vergata Rome

Zero forcing, propagation time, and throttling on a graph Leslie Hogben Iowa State University

Inference and Optimalities in Estimation of Gaussian Graphical Model Harrison H. Zhou Department

Energy-Efficient Management of Virtual Machines in Data Centers for Cloud Computing Anton

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - PowerPoint PPT Presentation

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

Philippine American War Political Cartoons Example 1: Judge , 1899 Example 2: Life , 1899 Set

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Architectures Architectural styles Software architectures Architectures versus middleware

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Multiprocessors and Multithreading Jason Mars Sunday, March 3, 13 Parallel Architectures for

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab

1,202 CPUs and 176 GPUs used by the Google DeepMind AI to win against the Go game

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios

Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

Using GPUs as CPUs for Engineering Applications: Challenges and Issues Michael A. Heroux Sandia

Milestones in Descriptive Complexity Theory ANU Logic Summer School, 2019 Presented by Sasha

Stronger Connections Between Circuit Analysis and Circuit Lower Bounds, via PCPs of Proximity

THE ATTRIBUTOR: A VERSATILE INTER-PROCEDURAL FIXPOINT ITERATION FRAMEWORK LLVM-Dev19

Heuristic (Informed) search strategy Search Algorithm #2 Search SEARCH#2 1.

An Improved LP-Based Approximation for Steiner Tree Fabrizio Grandoni Tor Vergata Rome

Zero forcing, propagation time, and throttling on a graph Leslie Hogben Iowa State University

Inference and Optimalities in Estimation of Gaussian Graphical Model Harrison H. Zhou Department

Energy-Efficient Management of Virtual Machines in Data Centers for Cloud Computing Anton

Architectures for Parallel Processing Current Architectures for Parallel "With the