GPU computing Part 2: CUDA examples (with some admixture of - - PowerPoint PPT Presentation

gpu computing
SMART_READER_LITE
LIVE PREVIEW

GPU computing Part 2: CUDA examples (with some admixture of - - PowerPoint PPT Presentation

Leftovers from yesterday Basics Memory Control flow Applications GPU computing Part 2: CUDA examples (with some admixture of introduction) Ch. Hoelbling Wuppertal University Lattice Practices 2011 Christian Hoelbling (Wuppertal) GPU


slide-1
SLIDE 1

Leftovers from yesterday Basics Memory Control flow Applications

GPU computing

Part 2: CUDA examples (with some admixture of introduction)

  • Ch. Hoelbling

Wuppertal University

Lattice Practices 2011

Christian Hoelbling (Wuppertal) GPU computing

slide-2
SLIDE 2

Leftovers from yesterday Basics Memory Control flow Applications

Outline

1

Leftovers from yesterday

2

Basics

3

Memory

4

Control flow

5

Applications

Christian Hoelbling (Wuppertal) GPU computing

slide-3
SLIDE 3

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview

CUDA PROGRAMMING MODEL

Threads are bundled in blocks

1, 2 or 3D index shared memory can synchronize branching with penalty current max. number: 1024

Blocks are bundeled in grids

1 or 2D index same kernel for all threads can not synchronize blocks branch without penalty

Christian Hoelbling (Wuppertal) GPU computing

slide-4
SLIDE 4

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview

CUDA HARDWARE MODEL

Multiprocessor: 32 scalar cores

4 FP ops per core per cycle SIMT (SIMD+branching) 32k registers shared memory works on N × 32 threads (= N warps) at a time Extremely fast context switches

GPU contains e.g. 16 MPs

share global and texture memory work independently

Christian Hoelbling (Wuppertal) GPU computing

slide-5
SLIDE 5

Leftovers from yesterday Basics Memory Control flow Applications CUDA overview

CUDA MEMORY MODEL

Per thread

registers local memory (RW), 512k

Per block

Shared memory(RW, max. 48k)

Per grid

global memory(RW) constant memory(R) texture memory(R, separately cached)

From host (CPU)

global memory(RW) constant memory(RW) texture memory(RW) all 3 are persistent

Christian Hoelbling (Wuppertal) GPU computing

slide-6
SLIDE 6

Leftovers from yesterday Basics Memory Control flow Applications Application overview

BENCHMARK

8

4

8

316

16

38

16

324

16

364

Volume 5 10 15 20 25 30 35 Gflops GPU 8800 GTX GPU 7900 GTX GPU 7800 GTX CPU P4 SSE Christian Hoelbling (Wuppertal) GPU computing

slide-7
SLIDE 7

Leftovers from yesterday Basics Memory Control flow Applications Application overview

THINGS TO NOTE

GPUs need large local lattice

Efficient use of massive parallelism

Further parallelization difficult

at GPU speeds, bus is a huge bottleneck unnecessary memcopies due to drivers (rapidly improving) better stay on one GPU!

Computation is for free!

GPUs (memory) bandwidth limited it pays off e.g. to reconstruct SU(3) matrix from its generators!

Reliability is a very serious issue

gamers don’t care about one bad pixel some models > 50% faliure rate danger of unnoticed memory corruption (no ECC) TESLA more expensive but not necessarily more reliable

Christian Hoelbling (Wuppertal) GPU computing

slide-8
SLIDE 8

Leftovers from yesterday Basics Memory Control flow Applications Application overview

GPUs IN PRACTICE

Really cheap farming (e.g. thermodynamics) Best for failsafe code

Matrix inversion: final residue check on CPU Convergence in spite of intermittent error Restart when error detected Control part on CPU

Why not openCL?

CUDA currently faster Similar (low level) syntax nvcc compiles openCL via intermediate CUDA Vendor lock in - ATI-openCL is getting better

Rapid development

render to texture, CUDA, IEEE, double, caches, multi-GPU, device-to-device, . . .

Christian Hoelbling (Wuppertal) GPU computing

slide-9
SLIDE 9

Leftovers from yesterday Basics Memory Control flow Applications Device code

A SIMPLE KERNEL

#include<stdio.h> #include<stdlib.h> #include<math.h> #define VL 128 __global__ void add_s(float* s, float* a, float* b) /* s=a+b */ { int i=threadIdx.x; s[i]=a[i]+b[i]; }

Christian Hoelbling (Wuppertal) GPU computing

slide-10
SLIDE 10

Leftovers from yesterday Basics Memory Control flow Applications Device code

A SIMPLE KERNEL

__global__ void add_s(float* s, float* a, float* b) __global__ ☞function on device, callable from host __device__ ☞function on device, callable from device __host__ ☞function on host, callable from host (default)

Christian Hoelbling (Wuppertal) GPU computing

slide-11
SLIDE 11

Leftovers from yesterday Basics Memory Control flow Applications Device code

A SIMPLE KERNEL

__host__ __device__ func() { #if __CUDA_ARCH__ == 100 // Device code path for compute capability 1.0 #elif __CUDA_ARCH__ == 200 // Device code path for compute capability 2.0 #elif !defined(__CUDA_ARCH__) // Host code path #endif } __CUDA_ARCH__ ☞conditional execution macro

Christian Hoelbling (Wuppertal) GPU computing

slide-12
SLIDE 12

Leftovers from yesterday Basics Memory Control flow Applications Device code

A SIMPLE KERNEL

int i=threadIdx.x; threadIdx ☞thread index within block - uint3 (x,y,z) blockDim ☞dimensions of the current thread block blockIdx ☞block index within grid gridDim ☞dimensions of the current block grid

Christian Hoelbling (Wuppertal) GPU computing

slide-13
SLIDE 13

Leftovers from yesterday Basics Memory Control flow Applications Setup

CALLING THE KERNEL

How are the dimensions determined? __global__ void add_s(float* s, float* a, float* b) ... int main() { ... /* now call the compute kernel on the device */ add_s<<<dim_grid,dim_block>>>(_s,_a,_b); ... }

Christian Hoelbling (Wuppertal) GPU computing

slide-14
SLIDE 14

Leftovers from yesterday Basics Memory Control flow Applications Setup

BASIC MEM ALLOCATION

Allocating device memory: /* allocate device memory */ cudaMalloc((void**)&_a,VL*sizeof(float)); Copying it to the device: /* copy over a to the device */ cudaMemcpy(_a,a,VL*sizeof(float),cudaMemcpyHostToDevice); Copying result back from device: /* copy result back to host */ cudaMemcpy(s,_s,VL*sizeof(float),cudaMemcpyDeviceToHost); Free device memory: /* free device memory */ cudaFree(_a);

Christian Hoelbling (Wuppertal) GPU computing

slide-15
SLIDE 15

Leftovers from yesterday Basics Memory Control flow Applications Setup

SOME SETUP

Checking devices present: /* get device info */ int deviceCount; cudaGetDeviceCount(&deviceCount); Query device properties: /* check device properties - only use device 0 */ cudaDeviceProp deviceProp; if (cudaGetDeviceProperties(&deviceProp,0)==cudaSuccess) printf(" Device: %s\n",deviceProp.name); Select a specific device: /* run on the device 0 */ cudaSetDevice(0);

Christian Hoelbling (Wuppertal) GPU computing

slide-16
SLIDE 16

Leftovers from yesterday Basics Memory Control flow Applications Setup

SIMPLE EXAMPLE CODE

All code in one file (demo.cu) No additional include files No libraries to link explicitly Compile with nvcc warper: nvcc -o demo.x demo.cu Direct machine code possible (PTX pseudo-assembly)

Usually not efficient

Runtime:

Driver compiles PTX into cubin binary format API allows for low level access

Christian Hoelbling (Wuppertal) GPU computing

slide-17
SLIDE 17

Leftovers from yesterday Basics Memory Control flow Applications Timing

TIMING

Create event structures: /* create timing events */ cudaEvent_t start,stop; cudaEventCreate(&start); cudaEventCreate(&stop); Record an event: /* start timie - all streams */ cudaEventRecord(start,0);

Christian Hoelbling (Wuppertal) GPU computing

slide-18
SLIDE 18

Leftovers from yesterday Basics Memory Control flow Applications Timing

TIMING (ctd.)

Record second event, synchronize and get time: /* end time - all streams */ cudaEventRecord(stop,0); /* synchronize after asynchronous call */ cudaEventSynchronize(stop); /* get time between events */ float dt; cudaEventElapsedTime(&dt,start,stop);

Christian Hoelbling (Wuppertal) GPU computing

slide-19
SLIDE 19

Leftovers from yesterday Basics Memory Control flow Applications Host memory

MEMORY OPTIMIZATION

Bandwidth in device memory: up to ∼ 200GB/s Bandwidth host-device: up to 8GB/s (PCIe 2.0 ×16) ☞ Host-to-device often bottleneck Improve by using pinned (non-pageable) host memory /* allocate non-pageable memory */ cudaHostAlloc((void**)&a,VL*sizeof(float), cudaHostAllocDefault);

Christian Hoelbling (Wuppertal) GPU computing

slide-20
SLIDE 20

Leftovers from yesterday Basics Memory Control flow Applications Host memory

MEMORY OPTIMIZATION

Even faster: write-combined memory /* allocate non-pageable memory */ cudaHostAlloc((void**)&a,VL*sizeof(float), cudaHostAllocWriteCombined); Warning: slow CPU access (only for pushing to device)

Christian Hoelbling (Wuppertal) GPU computing

slide-21
SLIDE 21

Leftovers from yesterday Basics Memory Control flow Applications Host memory

MEMORY OPTIMIZATION

Copy on demand: mapped memory /* allocate mapped memory */ cudaHostAlloc((void**)&a,VL*sizeof(float), cudaHostAllocMapped); Access from within a device kernel: /* access mapped memory from a device */ cudaHostGetDevicePointer((void**)&_a,a,0);

Christian Hoelbling (Wuppertal) GPU computing

slide-22
SLIDE 22

Leftovers from yesterday Basics Memory Control flow Applications Host memory

HIDING MEMORY TRANSFER

Asynchronous memcopy to device: /* copy over a to the device */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,0); Concurrent with CPU code Default “stream” 0 serializes GPU code Create non-default streams to parallelize with GPU /* create streams */ cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); /* host to device copy */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,stream1); /* overlay with independent operation */ add_s<<<NB,NT,0,stream2>>>(_s,_c,_d);

Christian Hoelbling (Wuppertal) GPU computing

slide-23
SLIDE 23

Leftovers from yesterday Basics Memory Control flow Applications Device memory

GLOBAL MEMORY

Simple compute kernel: __global__ void cadd_s(float* s, float* a, int n) /* iterate s=s*s+a */ { int i=threadIdx.x+blockDim.x*blockIdx.x; for (int j=0;j<n;j++) s[i]=s[i]*s[i]+a[i]; }

Christian Hoelbling (Wuppertal) GPU computing

slide-24
SLIDE 24

Leftovers from yesterday Basics Memory Control flow Applications Device memory

REGISTER MEMORY

The same with local variables: __global__ void cadd_s_loc(float* s, float* a, int n) /* iterate s=s*s+a using local storage */ { int i=threadIdx.x+blockDim.x*blockIdx.x; float r_s=s[i]; float r_a=a[i]; for (int j=0;j<n;j++) r_s=r_s*r_s+r_a; s[i]=r_s; }

Christian Hoelbling (Wuppertal) GPU computing

slide-25
SLIDE 25

Leftovers from yesterday Basics Memory Control flow Applications Device memory

DEVICE MEMORY OVERVIEW

To Host

Christian Hoelbling (Wuppertal) GPU computing

slide-26
SLIDE 26

Leftovers from yesterday Basics Memory Control flow Applications Device memory

DECLARING MEMORY

☞Lots of registers! ☞Global memory (device+host, cached on ≥ 2.0, RW): /* declare global variable */ __device__ float a; ☞Shared memory (per block, cached, RW): /* declare shared variable */ __shared__ float a; ☞Constant memory (device+host, cached separately, RO): /* declare a constant */ __constant__ float c; ☞Local memory (1 thread, cached on ≥ 2.0, RW): slow ☞Texture memory (device+host, cached separately, RO): obsolete

Christian Hoelbling (Wuppertal) GPU computing

slide-27
SLIDE 27

Leftovers from yesterday Basics Memory Control flow Applications Device memory

WHICH MEMORY TO USE

Registers: up to 128kB per MP Shared memory is faster than global Textures have separate cache

Optimized for 2D neighbor access Best option for old (1.x) devices Less efficient than global on new (2.x) devices

Don’t use local!

Not physically on the GPU Uncached on old (1.x) devices

Christian Hoelbling (Wuppertal) GPU computing

slide-28
SLIDE 28

Leftovers from yesterday Basics Memory Control flow Applications Device memory

MEMORY ALIGNMENT

Offset memory position of threads: __global__ void cadd_s(float* s, float* a, int n, int ofs) /* iterate s=s*s+a */ { int i=threadIdx.x+blockDim.x*blockIdx.x+ofs; for (int j=0;j<n;j++) s[i]=s[i]*s[i]+a[i]; }

Christian Hoelbling (Wuppertal) GPU computing

slide-29
SLIDE 29

Leftovers from yesterday Basics Memory Control flow Applications Device memory

MEMORY ALIGNMENT

The same with local variables: __global__ void cadd_s_local(float* s, float* a, int n, int ofs) /* iterate s=s*s+a using local storage */ { int i=threadIdx.x+blockDim.x*blockIdx.x+ofs; float r_s=s[i]; float r_a=a[i]; for (int j=0;j<n;j++) r_s=r_s*r_s+r_a; s[i]=r_s; }

Christian Hoelbling (Wuppertal) GPU computing

slide-30
SLIDE 30

Leftovers from yesterday Basics Memory Control flow Applications Device memory

MISALIGNMENT PENALTY

4 8 12 16

  • fs

1 2 3 4 5 GFLOPS global 4 8 12 16

  • fs

5 10 15 20 25 GFLOPS register

Christian Hoelbling (Wuppertal) GPU computing

slide-31
SLIDE 31

Leftovers from yesterday Basics Memory Control flow Applications Device memory

ALIGNMENT CONSIDERATIONS

Natural alignment: 128 bytes Optimal access for one warp (32 concurrently executed threads) Half warps (16 threads) without penalty Old (1.x) global access: 64B bursts New (2.x) global access: cached, 128B cache lines

2 cache levels L1 configurable from 16kB to 48kB per MP 64kB L1+shared memory per MP

Christian Hoelbling (Wuppertal) GPU computing

slide-32
SLIDE 32

Leftovers from yesterday Basics Memory Control flow Applications Data types

FLOAT4 DATA

__global__ void cadd_s_loc(float4* s, float4* a, int n) /* iterate s=s*s+a using local storage */ { int i=threadIdx.x+blockDim.x*blockIdx.x; float4 r_s=s[i]; float4 r_a=a[i]; for (int j=0;j<n;j++) { r_s.x=r_s.x*r_s.x+r_a.x; r_s.y=r_s.y*r_s.y+r_a.y; r_s.z=r_s.z*r_s.z+r_a.z; r_s.w=r_s.w*r_s.w+r_a.w; } s[i]=r_s; }

Christian Hoelbling (Wuppertal) GPU computing

slide-33
SLIDE 33

Leftovers from yesterday Basics Memory Control flow Applications Data types

VECTOR TYPES

CUDA offers vector types

float1, float2, float3, float4 double1, double2 (u)int1. . .4, (u)longlong1. . .2, char1. . .4, . . .

GPU(driver)-native: 128 bit vector instructions

Origin: color+transparency (rgbw) MPs can handle “vectors” well

Compiler tries to vectorize - explicit is better No penalty for swizzling

Christian Hoelbling (Wuppertal) GPU computing

slide-34
SLIDE 34

Leftovers from yesterday Basics Memory Control flow Applications Branches

BRANCHING

Branching is possible Within a warp, branches are serialized Use block sizes that are integer multiples of warp size Branch on blockIdx then gives no penalty

Christian Hoelbling (Wuppertal) GPU computing

slide-35
SLIDE 35

Leftovers from yesterday Basics Memory Control flow Applications Synchronization

SYNCHRONIZATION

Synchronize threads within a warp: __global__ void example(float* vec) /* sum over vector on GPU */ { int i=threadIdx.x; ... /* synchronize threads within a warp */ __syncthreads(); ... } ☞Forces MP to idle ☞Should be avoided ☞Declare variables volatile to ensure memory read

Christian Hoelbling (Wuppertal) GPU computing

slide-36
SLIDE 36

Leftovers from yesterday Basics Memory Control flow Applications Synchronization

ATOMIC OPERATIONS

Uninterrupted read-modify-write: __global__ void sum_s(float* vec) /* sum over vector on GPU */ { int i=threadIdx.x; atomicAdd(&_su,vec[i]); }

Christian Hoelbling (Wuppertal) GPU computing

slide-37
SLIDE 37

Leftovers from yesterday Basics Memory Control flow Applications Time savers

LIBRARIES AND TOOLS

CUDA-gdb integration Active thread analyzer tool (throughput) Profiler tools New (2.x) devices allow printf from within a kernel Libraries: BLAS, FFT, LAPACK Additional language bindings: c++, FORTRAN

Christian Hoelbling (Wuppertal) GPU computing

slide-38
SLIDE 38

Leftovers from yesterday Basics Memory Control flow Applications QCD

QCD optimizations

Arrange data in 64 byte aligned chunks Computation is essentially free Reconstruct 3rd row of SU(3) matrices on the fly More aggressive: Gell-Mann basis ☞unstable special cases Fix axial gauge, eliminate one SU(3) matrix Choose basis where one γµ is diagonal Multi precision solvers (even tried 16 bit) Large local memory parallelization

Christian Hoelbling (Wuppertal) GPU computing

slide-39
SLIDE 39

Leftovers from yesterday Basics Memory Control flow Applications Future

CUDA 4

Prerelease this weekend Pin malloc’ed memory: malloc(a); cudaHostRegister(a); Inline PTX assembly Multiple GPUs per thread, device call from within openMP Device to device copy across multiple devices Unified virtual memory space (host+devices)

Christian Hoelbling (Wuppertal) GPU computing