GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 - PowerPoint PPT Presentation

GPU WORKSHOP University of Maryland

1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3 CUDA C/C++ with hands-on 4 5 2

Parallel programming — Why do you care? 3

The world IS parallel

Accelerator Programming — Why do you Care? 5

Power for the city Power of 300 Petaflop = of San Francisco CPU-only Supercomputer HPC’s Biggest Challenge: Power

UNPRECEDENTED VALUE TO SCIENTIFIC COMPUTING AMBER Molecular Dynamics Simulation DHFR NVE Benchmark 64 Sandy Bridge CPUs 1 Tesla K40 GPU 58 ns/day 102 ns/day 7

3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 8

GPU ACCELERATOED LIBRARIES “Drop - on” Acceleration for your Applications Linear Algebra NVIDIA cuFFT, FFT , BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA NVIDIA cuRAND Math Lib Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Path Finding Board Games Visual Processing NVIDIA NVIDIA Image & Video Video NPP Encode 9

3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 10

OPENACC DIRECTIVES CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code 11

FAMILIAR TO OPENMP PROGRAMMERS OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } } 12

DIRECTIVES: EASY & POWERFUL Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” 13 -- Developer at the Global Manufacturer of Navigation Systems

A VERY SIMPLE EXERCISE: SAXPY SAXPY in C SAXPY in Fortran subrouti subroutine ne sa saxpy py(n (n, , a, x, a, x, y y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i float fl at * *x, x, $!acc $! acc kernels float *restrict y) do i=1,n do { y( y(i) = a*x(i)+y( )+y(i) #pragma #pragma acc ke acc kernels rnels enddo enddo for (int i = 0; i < n; ++i) for $! $!acc acc end kernels y[i] = a*x[i] + y[i]; end subr end subroutine outine saxpy saxpy } ... ... ... ... $ Perform SAXP $ Perfor m SAXPY on 1M Y on 1M elemen elements ts // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts call sa call saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ... 14

GPU Architecture 15

GPU ARCHITECTURE Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 12 GB ECC on/off options for Quadro and Tesla products Streaming Multiprocessors (SM) Perform the actual computation Each SM has its own: Control units, registers, execution pipelines, caches 16

GPU ARCHITECTURE Instruction Cache Scheduler Scheduler Dispatch Dispatch Streaming Multiprocessor (SM) Register File Core Core Core Core Many CUDA Cores per SM Core Core Core Core Core Core Core Core Architecture dependent Core Core Core Core Special-function units Core Core Core Core Core Core Core Core cos/sin/tan, etc. Core Core Core Core Core Core Core Core Shared mem + L1 cache Load/Store Units x 16 Special Func Units x 4 Thousands of 32-bit registers Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache 17

GPU ARCHITECTURE Instruction Cache Scheduler Scheduler Dispatch Dispatch CUDA Core Register File Core Core Core Core Floating point & Integer unit Core Core Core Core Core Core Core Core IEEE 754-2008 floating-point Core Core Core Core CUDA Core standard Core Core Core Core Dispatch Port Fused multiply-add (FMA) Core Core Core Core Operand Collector instruction for both single and Core Core Core Core double precision FP Unit INT Unit Core Core Core Core Logic unit Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network Move, compare unit 64K Configurable Cache/Shared Mem Branch unit Uniform Cache 18

GPU MEMORY HIERARCHY REVIEW SM-1 SM-0 SM-N Registers Registers Registers SMEM SMEM L1 SMEM L1 L1 L2 Global Memory 19

GPU ARCHITECTURE Memory System on each SM Extremely fast, but small, i.e., 10s of Kb Programmer chooses whether to use cache as L1 or Shared Mem L1 Hardware-managed — used for things like register spilling Should NOT attempt to utilize like CPU caches Shared Memory — programmer MUST synchronize data accesses!!! User-managed scratch pad Repeated access to same data or multiple threads with same data 20

GPU ARCHITECTURE Memory system on each GPU board Unified L2 cache (100s of Kb) Fast, coherent data sharing across all cores in the GPU ECC protection DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache 21

CUDA Programming model 22

ANATOMY OF A CUDA C/C++ APPLICATION Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA C/C++ Application Host = CPU Serial code Device = GPU Parallel code … Host = CPU Serial code Device = GPU Parallel code ... 23

CUDA C : C WITH A FEW KEYWORDS vo void id sa saxpy xpy_se _serial ial(i (int nt n, n, floa loat t a, a, flo float * t *x, x, fl float oat *y) *y) { for (i for (int nt i = i = 0; i 0; i < n; ++ n; ++i) i) y[i y[i] = ] = a* a*x[ x[i] i] + y + y[i]; i]; Standard C Code } // Invoke seri erial al SAXPY kernel sa saxpy xpy_s _seri erial( al(n, 2 , 2.0 .0, x , x, y , y); __ __glo globa bal__ l__ voi void s d sax axpy_ py_par paralle llel( l(int int n, n, flo float at a, a, fl float at *x *x, f , floa loat *y *y) { int int i i = = bl block ockId Idx.x* .x*blo blockD ckDim im.x + .x + th threa readI dIdx dx.x; .x; Parallel C Code if (i if (i < n) n) y[i y[i] = ] = a* a*x[i] [i] + + y[ y[i]; i]; } // Invoke par arall llel el SAXPY kernel with 256 threads/block in int n t nbl block ocks = s = (n (n + + 255 255) / ) / 256 256; sa saxpy xpy_p _para aralle llel<<< <<<nb nbloc locks, ks, 256 256>> >>>(n, (n, 2. 2.0, 0, x, x, y) y); 25

CUDA KERNELS Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels 26

CUDA KERNELS: PARALLEL THREADS A kernel is a function executed on the GPU as an array of threads in parallel float x = input[threadIdx.x]; float y = func(x); output[threadIdx.x] = y; All threads execute the same code, can take different paths Each thread has an ID Select input/output data Control decisions 27

CUDA Kernels: Subdivide into Blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

Kernel Execution CUDA core • Each thread is executed by a CUDA thread core • Each block is executed by CUDA Streaming one SM and does not migrate CUDA thread block Multiprocessor • Several concurrent blocks can reside on one SM depending … on the blocks’ memory requirements and the SM’s memory resources CUDA-enabled GPU CUDA kernel grid • Each kernel is executed on … one device …… • Multiple kernels can execute ... on a device at one time

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 - PowerPoint PPT Presentation

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3 CUDA C/C++ with hands-on 4 5 2 Parallel programming Why do you care? 3 The world IS parallel Accelerator Programming Why do you Care?

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Units acting on integers U K O K Marcelo Laca University of Victoria Abel Symposium, 10

Basic SQL types String Char(n): fixed length. Padded Varchar(n): variable length

An Introduction to Exotic Phases of Matter By Tyler D Ellison Outline 1. Vague statements

Drawing phase diagrams using deep convolutional neural networks

How to be a Productive Programmer by putting things ofg until tomorrow Robert Atkey University

Planning and Optimization E4. Linear & Integer Programming Malte Helmert and Gabriele R

A Taxonomy of Approaches for Integrating Attack Awareness in Applications Tolga nl, Dr.

Angular JS Part III Winter Semester 2016/17 Juliane Franze Ludwig-Maximilians-Universitt

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 - PowerPoint PPT Presentation

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3 CUDA C/C++ with hands-on 4 5 2 Parallel programming Why do you care? 3 The world IS parallel Accelerator Programming Why do you Care?

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Units acting on integers U K O K Marcelo Laca University of Victoria Abel Symposium, 10

Basic SQL types String Char(n): fixed length. Padded Varchar(n): variable length

An Introduction to Exotic Phases of Matter By Tyler D Ellison Outline 1. Vague statements

Drawing phase diagrams using deep convolutional neural networks

How to be a Productive Programmer by putting things ofg until tomorrow Robert Atkey University

Planning and Optimization E4. Linear &amp; Integer Programming Malte Helmert and Gabriele R

A Taxonomy of Approaches for Integrating Attack Awareness in Applications Tolga nl, Dr.

Angular JS Part III Winter Semester 2016/17 Juliane Franze Ludwig-Maximilians-Universitt

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Planning and Optimization E4. Linear & Integer Programming Malte Helmert and Gabriele R