GPU computing Part 2: CUDA examples (with some admixture of - PowerPoint PPT Presentation

Leftovers from yesterday Basics Memory Control flow Applications GPU computing Part 2: CUDA examples (with some admixture of introduction) Ch. Hoelbling Wuppertal University Lattice Practices 2011 Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Outline Leftovers from yesterday 1 Basics 2 Memory 3 Control flow 4 Applications 5 Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA PROGRAMMING MODEL Threads are bundled in blocks 1, 2 or 3D index shared memory can synchronize branching with penalty current max. number: 1024 Blocks are bundeled in grids 1 or 2D index same kernel for all threads can not synchronize blocks branch without penalty Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA HARDWARE MODEL Multiprocessor: 32 scalar cores 4 FP ops per core per cycle SIMT (SIMD+branching) 32k registers shared memory works on N × 32 threads ( = N warps) at a time Extremely fast context switches GPU contains e.g. 16 MPs share global and texture memory work independently Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA MEMORY MODEL Per thread registers local memory (RW), 512k Per block Shared memory(RW, max. 48k) Per grid global memory(RW) constant memory(R) texture memory(R, separately cached) From host (CPU) global memory(RW) constant memory(RW) texture memory(RW) all 3 are persistent Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Application overview BENCHMARK 35 GPU 8800 GTX GPU 7900 GTX 30 GPU 7800 GTX CPU P4 SSE 25 20 Gflops 15 10 5 4 3 16 3 8 3 24 3 64 8 8 16 16 16 Volume Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Application overview THINGS TO NOTE GPUs need large local lattice Efficient use of massive parallelism Further parallelization difficult at GPU speeds, bus is a huge bottleneck unnecessary memcopies due to drivers (rapidly improving) better stay on one GPU! Computation is for free! GPUs (memory) bandwidth limited it pays off e.g. to reconstruct SU(3) matrix from its generators! Reliability is a very serious issue gamers don’t care about one bad pixel some models > 50 % faliure rate danger of unnoticed memory corruption (no ECC) TESLA more expensive but not necessarily more reliable Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Application overview GPUs IN PRACTICE Really cheap farming (e.g. thermodynamics) Best for failsafe code Matrix inversion: final residue check on CPU Convergence in spite of intermittent error Restart when error detected Control part on CPU Why not openCL? CUDA currently faster Similar (low level) syntax nvcc compiles openCL via intermediate CUDA Vendor lock in - ATI-openCL is getting better Rapid development render to texture, CUDA, IEEE, double, caches, multi-GPU, device-to-device, . . . Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL #include<stdio.h> #include<stdlib.h> #include<math.h> #define VL 128 __global__ void add_s( float * s, float * a, float * b) /* s=a+b */ { int i=threadIdx.x; s[i]=a[i]+b[i]; } Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL __global__ void add_s( float * s, float * a, float * b) __global__ ☞ function on device, callable from host __device__ ☞ function on device, callable from device __host__ ☞ function on host, callable from host (default) Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL __host__ __device__ func() { #if __CUDA_ARCH__ == 100 // Device code path for compute capability 1.0 #elif __CUDA_ARCH__ == 200 // Device code path for compute capability 2.0 #elif !defined(__CUDA_ARCH__) // Host code path #endif } __CUDA_ARCH__ ☞ conditional execution macro Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL int i=threadIdx.x; threadIdx ☞ thread index within block - uint3 (x,y,z) blockDim ☞ dimensions of the current thread block blockIdx ☞ block index within grid gridDim ☞ dimensions of the current block grid Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Setup CALLING THE KERNEL How are the dimensions determined? __global__ void add_s( float * s, float * a, float * b) ... int main() { ... /* now call the compute kernel on the device */ add_s<<<dim_grid,dim_block>>>(_s,_a,_b); ... } Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Setup BASIC MEM ALLOCATION Allocating device memory: /* allocate device memory */ cudaMalloc(( void **)&_a,VL* sizeof ( float )); Copying it to the device: /* copy over a to the device */ cudaMemcpy(_a,a,VL* sizeof ( float ),cudaMemcpyHostToDevice); Copying result back from device: /* copy result back to host */ cudaMemcpy(s,_s,VL* sizeof ( float ),cudaMemcpyDeviceToHost); Free device memory: /* free device memory */ cudaFree(_a); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Setup SOME SETUP Checking devices present: /* get device info */ int deviceCount; cudaGetDeviceCount(&deviceCount); Query device properties: /* check device properties - only use device 0 */ cudaDeviceProp deviceProp; if (cudaGetDeviceProperties(&deviceProp,0)==cudaSuccess) printf(" Device: %s \n ",deviceProp.name); Select a specific device: /* run on the device 0 */ cudaSetDevice(0); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Setup SIMPLE EXAMPLE CODE All code in one file (demo.cu) No additional include files No libraries to link explicitly Compile with nvcc warper: nvcc -o demo.x demo.cu Direct machine code possible (PTX pseudo-assembly) Usually not efficient Runtime: Driver compiles PTX into cubin binary format API allows for low level access Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Timing TIMING Create event structures: /* create timing events */ cudaEvent_t start,stop; cudaEventCreate(&start); cudaEventCreate(&stop); Record an event: /* start timie - all streams */ cudaEventRecord(start,0); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Timing TIMING (ctd.) Record second event, synchronize and get time: /* end time - all streams */ cudaEventRecord(stop,0); /* synchronize after asynchronous call */ cudaEventSynchronize(stop); /* get time between events */ float dt; cudaEventElapsedTime(&dt,start,stop); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Bandwidth in device memory: up to ∼ 200GB/s Bandwidth host-device: up to 8GB/s (PCIe 2.0 × 16) ☞ Host-to-device often bottleneck Improve by using pinned (non-pageable) host memory /* allocate non-pageable memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocDefault); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Even faster: write-combined memory /* allocate non-pageable memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocWriteCombined); Warning: slow CPU access (only for pushing to device) Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Copy on demand: mapped memory /* allocate mapped memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocMapped); Access from within a device kernel: /* access mapped memory from a device */ cudaHostGetDevicePointer(( void **)&_a,a,0); Christian Hoelbling (Wuppertal) GPU computing

Leftovers from yesterday Basics Memory Control flow Applications Host memory HIDING MEMORY TRANSFER Asynchronous memcopy to device: /* copy over a to the device */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,0); Concurrent with CPU code Default “stream” 0 serializes GPU code Create non-default streams to parallelize with GPU /* create streams */ cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); /* host to device copy */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,stream1); /* overlay with independent operation */ add_s<<<NB,NT,0,stream2>>>(_s,_c,_d); Christian Hoelbling (Wuppertal) GPU computing

GPU computing Part 2: CUDA examples (with some admixture of - PowerPoint PPT Presentation

Leftovers from yesterday Basics Memory Control flow Applications GPU computing Part 2: CUDA examples (with some admixture of introduction) Ch. Hoelbling Wuppertal University Lattice Practices 2011 Christian Hoelbling (Wuppertal) GPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Efficient Method for Magnitude Comparison in RNS Based on Two Pairs of Conjugate Moduli Leonel

Arithmetic and Differential Swan conductors. The Rank one case via -exponentials. Andrea

Compatible systems along the boundary Weizhe Zheng Morningside Center of Mathematics, Academy of

Objectives Congruences: Modular Arithmetic Euler Totient Function Fermats Little

A Survey Language form Box1HouseOwning { hasSoldHouse: "Sold a house in 2010?" boolean

Last time: Basics of Laurent series A Laurent series is like a power series but were allowed to

Flipping Coins over the Telephone: First Attempt Games & Quadratic Residues Jim Royer

Public-Key Cryptography Lecture 10 DDH Assumption El Gamal Encryption Public-Key Encryption

GPU computing Part 2: CUDA examples (with some admixture of - PowerPoint PPT Presentation

Leftovers from yesterday Basics Memory Control flow Applications GPU computing Part 2: CUDA examples (with some admixture of introduction) Ch. Hoelbling Wuppertal University Lattice Practices 2011 Christian Hoelbling (Wuppertal) GPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Efficient Method for Magnitude Comparison in RNS Based on Two Pairs of Conjugate Moduli Leonel

Arithmetic and Differential Swan conductors. The Rank one case via -exponentials. Andrea

Compatible systems along the boundary Weizhe Zheng Morningside Center of Mathematics, Academy of

Objectives Congruences: Modular Arithmetic Euler Totient Function Fermats Little

A Survey Language form Box1HouseOwning { hasSoldHouse: &quot;Sold a house in 2010?&quot; boolean

Last time: Basics of Laurent series A Laurent series is like a power series but were allowed to

Flipping Coins over the Telephone: First Attempt Games &amp; Quadratic Residues Jim Royer

Public-Key Cryptography Lecture 10 DDH Assumption El Gamal Encryption Public-Key Encryption

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

A Survey Language form Box1HouseOwning { hasSoldHouse: "Sold a house in 2010?" boolean

Flipping Coins over the Telephone: First Attempt Games & Quadratic Residues Jim Royer