Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? - - PowerPoint PPT Presentation
Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? - - PowerPoint PPT Presentation
Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs. CPU Why to Learn About GPU? NVIDIA GPU relative performances Why to Learn About GPU? Hardware Why to Learn About GPU? Interactive rendering
Why to Learn About GPU?
Computational power of GPU vs. CPU
Why to Learn About GPU?
NVIDIA GPU relative performances
Why to Learn About GPU?
Hardware
Why to Learn About GPU?
Interactive rendering delivers almost off-line quality
Why to Learn About GPU?
GPU programming allows to
parallelize the computation via data parallel streaming strategy
SIMT
Single instruction multiple threads 10000s of simultaneously active threads
Examples
OpenCL, CUDA
Direct graphical enhancements
OpenGL / Direct X Shaders (GLSL,HLSL,Cg)
Replaceable state-machines on the GPU
Why to Learn About GPU?
Application fields
Force-field simulations
Particles systems, molecular simulations, graph drawing
Voronoi diagrams
Data analysis, motion planning, geometric optimization
Sorting and searching
Database and range queries
Matrix multiplication
Physical simulation
FFT
Signal processing
Visualization
Volume rendering, raycasting
Why to Learn About GPU?
Application fields
Force-field simulations
Particles systems, molecular simulations, graph drawing
Voronoi diagrams
Data analysis, motion planning, geometric optimization
Sorting and searching
Database and range queries
Matrix multiplication
Physical simulation
FFT
Signal processing
Visualization
Volume rendering, raycasting
Aim: real-time and interactive visualization of complex phenomena!
Introduction to CUDA
Julius Parulek
CUDA
Why to learn CUDA?
Random and unlimited access to memory
Read/write whenever necessary
Cooperation through shared memory High learning curve
Few extensions to C
No graphics overhead Quick implementation
Focus on parallelization and not programming A pen and a paper to analyze algorithms
Number of blocks of a number of threads
CUDA
CUDA = serial program with parallel kernels Differences between CUDA and CPU threads
CUDA threads are extremely lightweight
Very little creation overhead Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
CUDA threads are physical
NVIDIA GPUs
CUDA
Kernel
A simple C program
Kernels are executed by thread
Thousands of threads execute the same kernel
Parallelization
Each thread has its own threadID
threadID
CUDA
Threads are organized into blocks
Threads in a block can synchronize (parallel task)
Blocks are grouped into a grid
Blocks are independent (might coordinate)
CUDA
Memory hierarchy
Thread accesses Block accesses Grid accesses
threadID registers shared memory global memory
CUDA
Memory space overview
CUDA
Kernel/Thread execution
One kernel at a time is executed as a grid A block executes on one thread processor Several blocks can execute concurrently on one thread processor (multiprocessor)
CUDA
Variable keywords
__global__ void KernelFunc(...);
kernel function, runs on device
__device__ int GlobalVar;
variable in device memory
__shared__ int SharedVar;
variable in per-block shared memory
Execute the kernel
kernelFunc<<<500,128>>>(a,b)
1D array of 500 blocks where each contains 1D array
- f 128 threads = 500x128=64000 threads
CUDA
Get the thread ID CUDA variables
dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x
blockDim.x blockIdx.x threadIdx.x
CUDA
Get the thread ID CUDA variables
dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x
blockDim.x blockIdx.x threadIdx.x //A is in shared memory A[threadID] = begin[threadID]; //sync threads within a block __syncthreads(); int left = A[threadID - 1];
CUDA
Get the thread ID CUDA variables
dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x
blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; /* atomic counter is increased when element threadID was accessed atomicInc(&a,i) -> (*a>=i) ? 0 : (*a++) */ val = atomicInc(B[threadID],val); int left = A[threadID - 1];
CUDA
Atomic functions
Atomic operation is guaranteed to be performed without interference from other threads! atomicAdd,atomicSub,atomicExch,atomicMin,atomic Max,atomicInc,atomicDec,atomicCAS, atomicAnd, atomicOr, atomicXor Notes
Shared memory is faster then using atomic operations in global memory Minimize their usage
Causing code serialization
20
CUDA
Hello World in CUDA
//add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }
CUDA
Hello World in CUDA
//add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }
CUDA
Histogram in CUDA using atomicInc
//compute a histogram b of the array a with a bin size of d __global__ void histogram( float *a, int *b, float d, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N){ int bin = int(a[idx]/d); atomicInc(b+bin, N); } } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); histogram <<<dimGrid, dimBlock>>>(a, b, d, N); }
CUDA
Thread accessing privileges
We already know that each thread can:
Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory
Additionally each thread can also:
Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory
texture<float,2> my_texture;
declare texture reference
float4 texel = texfetch (my_texture, u, v);
fetch the texel value
CUDA
Any source file containing CUDA language extensions must be compiled with NVCC
NVCC separates code running on the host from code running
- n the device
When running in device emulation mode, one can:
Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf ) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads
C++ source file nvcc CPU GPU nvcc –deviceemu //use for debugging
CUDA
Reducing problems
Reduce N values to a single one
Sum(v0, v1, … , vN-2, vN-1) Max(v0, v1, … , vN-2, vN-1) Min(v0, v1, … , vN-2, vN-1)
CUDA
Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)
CUDA
Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)
__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }
CUDA
Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)
CUDA
Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)
__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }
CUDA
Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)
__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads();
// do reduction in shared mem, unroll for blockDim = 32 If (tid<16) sdata[tid] += sdata[tid + 16]; __syncthreads(); If (tid<8) sdata[tid] += sdata[tid + 8]; __syncthreads(); If (tid<4) sdata[tid] += sdata[tid + 4]; __syncthreads(); If (tid<2) sdata[tid] += sdata[tid + 2]; __syncthreads(); If (tid<1) sdata[tid] += sdata[tid + 1]; __syncthreads();
// write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }
CUDA
Example: Sparse matrix vector multiplication
Kernel construction: y=Ax Strategy
One thread is assigned to each row in matrix A
CUDA
Example: Sparse matrix vector multiplication
Kernel construction: y=Ax Strategy
One thread is assigned to each row in matrix A
CUDA
Example: Sparse matrix vector multiplication
Kernel construction: y=Ax Strategy
One thread is assigned to each row in matrix A
CUDA
Example: Sparse matrix vector multiplication
Sparse matrix A is represented by data, rows, and cols
CUDA
Example: Sparse matrix vector multiplication
Sparse diagonal matrix A is represented by data and offsets
CUDA
Example: Sparse matrix vector multiplication
CUDA
Example: Sparse matrix vector multiplication
Sparse non-diagonal matrix A is represented by data and indices Indices is the matrix n x k
k is the maximal number of non-zero elements in row
CUDA
Example: Sparse matrix vector multiplication
CUDA
Example: Sparse matrix vector multiplication
Sparse matrix A is represented by data, indices, and ptr
ptr[i] is an offset in data for the i-th row
CUDA
Example: Sparse matrix vector multiplication
CUDA
Example: Sparse matrix vector multiplication
__global__ void spmv_csr_vector_kernel(const int num_rows , const int * ptr , const int * indices , const float * data , const float * x, float * y) { __shared__ float vals[]; int thread_id = blockDim.x * blockIdx.x + threadIdx.x; int row = blockIdx.x; if (row < num_rows){ int row_start = ptr[row]; int row_end = ptr[row+1]; vals[threadIdx.x] = 0; int jj = row_start + threadIdx.x; If (jj<row_end) vals[threadIdx.x] = data[jj] * x[indices[jj]]; __syncthreads(); // parallel reduction in shared memory if (tid < 16) vals[threadIdx.x] += vals[threadIdx.x + 16]; if (tid < 8) vals[threadIdx.x] += vals[threadIdx.x + 8]; if (tid < 4) vals[threadIdx.x] += vals[threadIdx.x + 4]; if (tid < 2) vals[threadIdx.x] += vals[threadIdx.x + 2]; if (tid < 1) vals[threadIdx.x] += vals[threadIdx.x + 1]; // first thread writes the result if (tid == 0) y[row] += vals[threadIdx.x]; } }
CUDA
Ray-casting of implicit functions
//compute a histogram b of the array a with a bin size of d __global__ void raycaste(out* int,int2 N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; int idy = blockIdx.y * blockDim.y+ threadIdx.y; if (idx < N.x && idy < N.x){ Ray r(idx,idy); p = function_intersection(r,n); clr = compute_shading(p,n);
- ut[idy*N.y+idx]=clr;
} }