Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? - PowerPoint PPT Presentation

Motivation to Learn GPGPU Julius Parulek

Why to Learn About GPU? Computational power of GPU vs. CPU

Why to Learn About GPU? NVIDIA GPU relative performances

Why to Learn About GPU? Hardware

Why to Learn About GPU? Interactive rendering delivers almost off-line quality

Why to Learn About GPU? GPU programming allows to parallelize the computation via data parallel streaming strategy SIMT Single instruction multiple threads 10000s of simultaneously active threads Examples OpenCL, CUDA Direct graphical enhancements OpenGL / Direct X Shaders (GLSL,HLSL,Cg) Replaceable state-machines on the GPU

Why to Learn About GPU? Application fields Force-field simulations Particles systems, molecular simulations, graph drawing Voronoi diagrams Data analysis, motion planning, geometric optimization Sorting and searching Database and range queries Matrix multiplication Physical simulation FFT Signal processing Visualization Volume rendering, raycasting

Why to Learn About GPU? Application fields Force-field simulations Particles systems, molecular simulations, graph drawing Voronoi diagrams Data analysis, motion planning, geometric optimization Sorting and searching Database and range queries Matrix multiplication Physical simulation Aim: FFT real-time and interactive Signal processing visualization of Visualization complex phenomena! Volume rendering, raycasting

Introduction to CUDA Julius Parulek

CUDA Why to learn CUDA? Random and unlimited access to memory Read/write whenever necessary Cooperation through shared memory High learning curve Few extensions to C No graphics overhead Quick implementation Focus on parallelization and not programming A pen and a paper to analyze algorithms Number of blocks of a number of threads

CUDA CUDA = serial program with parallel kernels Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few CUDA threads are physical NVIDIA GPUs

CUDA Kernel A simple C program Kernels are executed by thread Thousands of threads execute the same kernel Parallelization Each thread has its own threadID threadID

CUDA Threads are organized into blocks Threads in a block can synchronize (parallel task) Blocks are grouped into a grid Blocks are independent (might coordinate)

CUDA Memory hierarchy Thread accesses threadID registers Block accesses shared memory Grid accesses global memory

CUDA Memory space overview

CUDA Kernel/Thread execution One kernel at a time is executed as a grid A block executes on one thread processor Several blocks can execute concurrently on one thread processor (multiprocessor)

CUDA Variable keywords __global__ void KernelFunc(...); kernel function, runs on device __device__ int GlobalVar; variable in device memory __shared__ int SharedVar; variable in per-block shared memory Execute the kernel kernelFunc<<<500,128>>>(a,b) 1D array of 500 blocks where each contains 1D array of 128 threads = 500x128=64000 threads

CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x threadIdx.x blockDim.x blockIdx.x

CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x threadIdx.x blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; //sync threads within a block __syncthreads(); int left = A[threadID - 1];

CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; /* atomic counter is increased when element threadID was accessed atomicInc(&a,i) -> (*a>=i) ? 0 : (*a++) */ val = atomicInc(B[threadID],val); int left = A[threadID - 1];

CUDA Atomic functions Atomic operation is guaranteed to be performed without interference from other threads! atomicAdd,atomicSub,atomicExch,atomicMin,atomic Max,atomicInc,atomicDec,atomicCAS, atomicAnd, atomicOr, atomicXor Notes Shared memory is faster then using atomic operations in global memory Minimize their usage Causing code serialization 20

CUDA Hello World in CUDA //add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

CUDA Histogram in CUDA using atomicInc //compute a histogram b of the array a with a bin size of d __global__ void histogram( float *a, int *b, float d, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N){ int bin = int(a[idx]/d); atomicInc(b+bin, N); } } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); histogram <<<dimGrid, dimBlock>>>(a, b, d, N); }

CUDA Thread accessing privileges We already know that each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Additionally each thread can also: Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory texture<float,2> my_texture; declare texture reference float4 texel = texfetch (my_texture, u, v); fetch the texel value

CUDA Any source file containing CUDA language extensions must be compiled with NVCC NVCC separates code running on the host from code running on the device CPU C++ source file nvcc GPU nvcc – deviceemu //use for debugging When running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf ) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads

CUDA Reducing problems Reduce N values to a single one Sum(v 0 , v 1 , … , v N-2 , v N-1 ) Max(v 0 , v 1 , … , v N-2 , v N-1 ) Min(v 0 , v 1 , … , v N-2 , v N-1 )

CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 )

CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 )

CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem, unroll for blockDim = 32 If (tid<16) sdata[tid] += sdata[tid + 16]; __syncthreads(); If (tid<8) sdata[tid] += sdata[tid + 8]; __syncthreads(); If (tid<4) sdata[tid] += sdata[tid + 4]; __syncthreads(); If (tid<2) sdata[tid] += sdata[tid + 2]; __syncthreads(); If (tid<1) sdata[tid] += sdata[tid + 1]; __syncthreads(); // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

CUDA Example: Sparse matrix vector multiplication Kernel construction: y=Ax Strategy One thread is assigned to each row in matrix A

CUDA Example: Sparse matrix vector multiplication Sparse matrix A is represented by data , rows , and cols

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? - PowerPoint PPT Presentation

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs. CPU Why to Learn About GPU? NVIDIA GPU relative performances Why to Learn About GPU? Hardware Why to Learn About GPU? Interactive rendering

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Learn Blackboard Learn Learn with others Learn in your own time, pace, space Learn through

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Lunch n Learn Lunch n Learn Lunch n Learn Lunch n Learn Understanding Understanding

CS 330 Paper Review Learning to learn distributions Why Learn distributions aka learn

Learn and live 1 Corinthians 10 1 Corinthians 10 1 Corinthians 10 1 Corinthians 10 Learn and

gpucc: An Open-Source GPGPU Compiler Jingyue Wu , Artem Belevich, Eli Bendersky, Mark Heffernan,

Global Constraints (continued) Nicolas Beldiceanu SICS Lgerhyddsvgen 5 SE-75237 Uppsala,

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 11 Yan n Gu

CSC 151 Spring 2020 Topic: Higher Order Procedures April 8, 2019 Day 28 Left-Section,

C++ Constructs by Examples Jan Faigl Department of Computer Science Faculty of Electrical

Introduction to Logic Programming Foundations, First-Order Language Temur Kutsia Research

POSIX Threads In the UNIX environment a thread: Exists within a process and uses the process

Sliding-Window Aggregation in Worst-Case Constant Time Martin Hirzel, IBM Research AI 30 October

A Formally Verified Interpreter for a Shell-like Programming Language Claude March e Nicolas