GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 - - PowerPoint PPT Presentation

gpu workshop
SMART_READER_LITE
LIVE PREVIEW

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 - - PowerPoint PPT Presentation

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3 CUDA C/C++ with hands-on 4 5 2 Parallel programming Why do you care? 3 The world IS parallel Accelerator Programming Why do you Care?


slide-1
SLIDE 1

University of Maryland

GPU WORKSHOP

slide-2
SLIDE 2

2

1 Intro to GPU Computing 2 OpenACC with hands-on 3 CUDA C/C++ with hands-on 4 5 AGENDA

slide-3
SLIDE 3

3

Parallel programming—Why do you care?

slide-4
SLIDE 4

The world IS parallel

slide-5
SLIDE 5

5

Accelerator Programming—Why do you Care?

slide-6
SLIDE 6

Power of 300 Petaflop CPU-only Supercomputer

=

Power for the city

  • f San Francisco

HPC’s Biggest Challenge: Power

slide-7
SLIDE 7

7

UNPRECEDENTED VALUE TO SCIENTIFIC COMPUTING

1 Tesla K40 GPU

102 ns/day

64 Sandy Bridge CPUs

58 ns/day

AMBER Molecular Dynamics Simulation DHFR NVE Benchmark

slide-8
SLIDE 8

8

3 WAYS TO ACCELERATE APPLICATIONS

Applications

Libraries

“Drop-in” Acceleration

Programming Languages OpenACC Directives

Maximum Flexibility Easily Accelerate Applications

slide-9
SLIDE 9

9

Linear Algebra

FFT , BLAS, SPARSE, Matrix

Numerical & Math

RAND, Statistics

Data Struct. & AI

Sort, Scan, Zero Sum

Visual Processing

Image & Video

NVIDIA cuFFT, cuBLAS, cuSPARSE NVIDIA Math Lib NVIDIA cuRAND NVIDIA NPP NVIDIA Video Encode GPU AI – Board Games GPU AI – Path Finding

GPU ACCELERATOED LIBRARIES

“Drop-on” Acceleration for your Applications

slide-10
SLIDE 10

10

3 WAYS TO ACCELERATE APPLICATIONS

Applications

Libraries

“Drop-in” Acceleration

Programming Languages OpenACC Directives

Maximum Flexibility Easily Accelerate Applications

slide-11
SLIDE 11

11

OPENACC DIRECTIVES

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

CPU GPU

Your original Fortran or C code

Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs

OpenACC Compiler Hint

slide-12
SLIDE 12

12

FAMILIAR TO OPENMP PROGRAMMERS

main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU OpenMP

main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU GPU OpenACC

slide-13
SLIDE 13

13

DIRECTIVES: EASY & POWERFUL

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.”

  • - Developer at the Global Manufacturer of Navigation Systems

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

slide-14
SLIDE 14

14

subrouti subroutine ne sa saxpy py(n (n, , a, x, a, x, y y) real :: x(:), y(:), a integer :: n, i $! $!acc acc kernels do do i=1,n y( y(i) = a*x(i)+y( )+y(i) enddo enddo $! $!acc acc end kernels end subr end subroutine

  • utine saxpy

saxpy ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts call call sa saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) ... ... void sax void saxpy(int py(int n, n, float a, fl float at * *x, x, float *restrict y) { #pragma #pragma acc ke acc kernels rnels for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... ... // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ...

A VERY SIMPLE EXERCISE: SAXPY

SAXPY in C SAXPY in Fortran

slide-15
SLIDE 15

15

GPU Architecture

slide-16
SLIDE 16

16

GPU ARCHITECTURE

Global memory

Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 12 GB ECC on/off options for Quadro and Tesla products

Streaming Multiprocessors (SM)

Perform the actual computation Each SM has its own: Control units, registers, execution pipelines, caches

Two Main Components

slide-17
SLIDE 17

17

GPU ARCHITECTURE

Many CUDA Cores per SM

Architecture dependent

Special-function units

cos/sin/tan, etc.

Shared mem + L1 cache Thousands of 32-bit registers

Streaming Multiprocessor (SM)

Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache

slide-18
SLIDE 18

18

GPU ARCHITECTURE

Floating point & Integer unit

IEEE 754-2008 floating-point standard Fused multiply-add (FMA) instruction for both single and double precision

Logic unit Move, compare unit Branch unit

CUDA Core

Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache

CUDA Core

Dispatch Port Operand Collector Result Queue FP Unit INT Unit

slide-19
SLIDE 19

19

GPU MEMORY HIERARCHY REVIEW

L2 Global Memory Registers

L1

SM-N

SMEM

Registers

L1

SM-0

SMEM

Registers

L1

SM-1

SMEM

slide-20
SLIDE 20

20

GPU ARCHITECTURE

Extremely fast, but small, i.e., 10s of Kb Programmer chooses whether to use cache as L1 or Shared Mem

L1

Hardware-managed—used for things like register spilling Should NOT attempt to utilize like CPU caches

Shared Memory—programmer MUST synchronize data accesses!!!

User-managed scratch pad Repeated access to same data or multiple threads with same data

Memory System on each SM

slide-21
SLIDE 21

21

GPU ARCHITECTURE

Unified L2 cache (100s of Kb)

Fast, coherent data sharing across all cores in the GPU

ECC protection DRAM

ECC supported for GDDR5 memory

All major internal memories are ECC protected

Register file, L1 cache, L2 cache

Memory system on each GPU board

slide-22
SLIDE 22

22

CUDA Programming model

slide-23
SLIDE 23

23

ANATOMY OF A CUDA C/C++ APPLICATION

Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements

CUDA C/C++ Application

Serial code Serial code Parallel code Parallel code

Device = GPU

Host = CPU Device = GPU

...

Host = CPU

slide-24
SLIDE 24

25

CUDA C : C WITH A FEW KEYWORDS

vo void id sa saxpy xpy_se _serial ial(i (int nt n, n, floa loat t a, a, flo float * t *x, x, fl float

  • at *y)

*y) { for for (i (int nt i = i = 0; i 0; i < n; ++ n; ++i) i) y[i y[i] = ] = a* a*x[ x[i] i] + y + y[i]; i]; } // Invoke seri erial al SAXPY kernel sa saxpy xpy_s _seri erial( al(n, 2 , 2.0 .0, x , x, y , y); __ __glo globa bal__ l__ voi void s d sax axpy_ py_par paralle llel( l(int int n, n, flo float at a, a, fl float at *x *x, f , floa loat *y *y) { int int i i = = bl block

  • ckId

Idx.x* .x*blo blockD ckDim im.x + .x + th threa readI dIdx dx.x; .x; if if (i (i < n) n) y[i y[i] = ] = a* a*x[i] [i] + + y[ y[i]; i]; } // Invoke par arall llel el SAXPY kernel with 256 threads/block in int n t nbl block

  • cks =

s = (n (n + + 255 255) / ) / 256 256; sa saxpy xpy_p _para aralle llel<<< <<<nb nbloc locks, ks, 256 256>> >>>(n, (n, 2. 2.0, 0, x, x, y) y);

Standard C Code Parallel C Code

slide-25
SLIDE 25

26

CUDA KERNELS

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight Fast switching 1000s execute simultaneously

CPU Host Executes functions GPU Device Executes kernels

slide-26
SLIDE 26

27

CUDA KERNELS: PARALLEL THREADS

A kernel is a function executed

  • n the GPU as an array of

threads in parallel All threads execute the same code, can take different paths Each thread has an ID

Select input/output data Control decisions

float x = input[threadIdx.x]; float y = func(x);

  • utput[threadIdx.x] = y;
slide-27
SLIDE 27

CUDA Kernels: Subdivide into Blocks

slide-28
SLIDE 28

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

slide-29
SLIDE 29

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks Blocks are grouped into a grid

slide-30
SLIDE 30

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

slide-31
SLIDE 31

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

GPU

slide-32
SLIDE 32

Kernel Execution

  • Each kernel is executed on
  • ne device
  • Multiple kernels can execute
  • n a device at one time

… ……

CUDA-enabled GPU CUDA thread

  • Each thread is executed by a

core CUDA core CUDA thread block

  • Each block is executed by
  • ne SM and does not migrate
  • Several concurrent blocks can

reside on one SM depending

  • n the blocks’ memory

requirements and the SM’s memory resources

CUDA Streaming Multiprocessor CUDA kernel grid

...

slide-33
SLIDE 33

Thread blocks allow cooperation

Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache

Threads may need to cooperate:

Cooperatively load/store blocks of memory all will use Share results with each other or cooperate to produce a single result Synchronize with each other

slide-34
SLIDE 34

35

THREAD BLOCKS ALLOW SCALABILITY

Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability:

A kernel scales across any number of SMs

Device with 2 SMs

SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Kernel Grid Launch

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Device with 4 SMs

SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

slide-35
SLIDE 35

36

Memory System Hierarchy

slide-36
SLIDE 36

37

MEMORY HIERARCHY

Thread:

Registers

slide-37
SLIDE 37

38

MEMORY HIERARCHY

Thread:

Registers Local memory

Local Local Local Local Local Local Local

slide-38
SLIDE 38

39

MEMORY HIERARCHY

Thread:

Registers Local memory

Block of threads:

Shared memory

slide-39
SLIDE 39

40

MEMORY HIERARCHY : SHARED MEMORY

__shared__ int a[SIZE]; Allocated per thread block, same lifetime as the block Accessible by any thread in the block Several uses:

Sharing data among threads in a block User-managed cache (reducing gmem accesses)

slide-40
SLIDE 40

41

MEMORY HIERARCHY

Thread:

Registers Local memory

Block of threads:

Shared memory

All blocks:

Global memory

slide-41
SLIDE 41

42

MEMORY HIERARCHY : GLOBAL MEMORY

Accessible by all threads of any kernel Data lifetime: from allocation to deallocation by host code

cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer)

slide-42
SLIDE 42

43

CUDA memory management

slide-43
SLIDE 43

44

MEMORY SPACES

CPU and GPU have separate memory spaces

Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard C

Pointers are just addresses

Can’t tell from the pointer value whether the address is on CPU or GPU

Must use cudaPointerGetAttributes(…)

Must exercise care when dereferencing:

Dereferencing CPU pointer on GPU will likely crash Dereferencing GPU pointer on CPU will likely crash

slide-44
SLIDE 44

45

GPU MEMORY ALLOCATION / RELEASE

Host (CPU) manages device (GPU) memory

cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer)

int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Note: Device memory from GPU point of view is also referred to as global memory.

slide-45
SLIDE 45

46

DATA COPIES

cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKind

cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

Non-blocking memcopies are provided

slide-46
SLIDE 46

47

Basic kernels and execution

slide-47
SLIDE 47

48

CUDA PROGRAMMING MODEL REVISITED

Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

slide-48
SLIDE 48

49

THREAD HIERARCHY

Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory

slide-49
SLIDE 49

50

IDS AND DIMENSIONS

Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1)

(Continued)

Threads 3D IDs, unique within a block Blocks 2D IDs, unique within a grid Dimensions set at launch time Can be unique for each grid Built-in variables threadIdx, blockIdx blockDim, gridDim

slide-50
SLIDE 50

51

IDS AND DIMENSIONS

Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

Threads 3D IDs, unique within a block Blocks 2D IDs, unique within a grid Dimensions set at launch time Can be unique for each grid Built-in variables threadIdx, blockIdx blockDim, gridDim

slide-51
SLIDE 51

52

LAUNCHING KERNELS ON GPU

Launch parameters (triple chevron <<<>>> notation)

grid dimensions (up to 2D), dim3 type thread-block dimensions (up to 3D), dim3 type shared memory: number of bytes per block

for extern smem variables declared without size Optional, 0 by default

stream ID

Optional, 0 by default dim3 grid(16, 16); dim3 block(16,16); kernel<<<grid, block, 0, 0>>>(...); kernel<<<32, 512>>>(...);

slide-52
SLIDE 52

53

GPU KERNEL EXECUTION

Kernel launches on a grid of blocks, <<<grid,block>>>(arg1,…) Each block is launched on one SM

A block is divided into warps of 32 threads each (think 32-way vector) Warps in a block are scheduled and executed.

All threads in a warp execute same instruction simultaneously (think SIMD)

Number of blocks/SM determined by resources required by the block

Registers, shared memory, total warps, etc.

Block runs to completion on SM it started on, no migration.

slide-53
SLIDE 53

54

Thread Block Multiprocessor

32 Threads 32 Threads 32 Threads

...

Warps

A thread block consists of 32-thread warps A warp is executed physically in parallel (SIMD) on a multiprocessor =

WARPS (THE REST OF THE STORY…)

slide-54
SLIDE 54

55

Software Hardware

Threads are executed by scalar processors

Thread Scalar Processor Thread Block Multiprocessor

Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file) ...

Grid Device

A kernel is launched as a grid of thread blocks

EXECUTION MODEL

slide-55
SLIDE 55

56

BLOCKS MUST BE INDEPENDENT

Any possible interleaving of blocks should be valid

presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially

Blocks may coordinate but not synchronize

shared queue pointer: OK shared lock: BAD … any dependence on order easily deadlocks

Independence requirement gives scalability