GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - PowerPoint PPT Presentation

GPU COMPUTING WITH OPENACC

3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 2

OPENACC DIRECTIVES CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code 3

FAMILIAR TO OPENMP PROGRAMMERS OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } } 4

Op OpenA enACC CC Me Member mbers s and nd Sup uppor porter ers

DIRECTIVES: EASY & POWERFUL Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The ” most important thing is avoiding restructuring of existing code for production applications. 6 -- Developer at the Global Manufacturer of Navigation Systems

FOCUS ON EXPOSING PARALLELISM With Directives, tuning work focuses on exposing parallelism , which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL CAM-SE S3D Answer questions about specific Research more efficient climate change adaptation and combustion with next- mitigation scenarios generation fuels • Tuning top 3 kernels (90% of runtime) • Tuning top key kernel (50% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU 6.5x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50% Improved performance of CPU version by 100% • • 7

A VERY SIMPLE EXERCISE: SAXPY SAXPY in C SAXPY in Fortran subroutine subrouti ne sa saxpy py(n (n, , a, x, a, x, y y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i fl float at * *x, x, $! $!acc acc kernels float *restrict y) do do i=1,n { y(i) = a*x(i)+y( y( )+y(i) #pragma #pragma acc acc kernels enddo enddo for for (int i = 0; i < n; ++i) $! $!acc acc end kernels y[i] = a*x[i] + y[i]; end subroutine end subr outine saxpy saxpy } ... ... ... ... $ Perform SAXP $ Perfor m SAXPY on 1M Y on 1M elemen elements ts // Perform SAX // Perfo rm SAXPY on 1M PY on 1M eleme elements nts call sa call saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ... 8

DIRECTIVE SYNTAX Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block 9

KERNELS : YOUR FIRST OPENACC DIRECTIVE Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 kernel 1 Kernel: b(i) = 1.0 A parallel function c(i) = 2.0 end do that runs on the GPU do i=1,n a(i) = b(i) + c(i) kernel 2 end do !$acc end kernels 10

KERNELS CONSTRUCT Fortran C !$acc kernels [clause …] #pragma acc kernels [clause …] structured block { structured block } !$acc end kernels Clauses if( condition ) async( expression ) Also, any data clause (more later) 11

COMPLETE SAXPY EXAMPLE CODE Trivial first example int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats Apply a loop directive if if (ar argc gc > > 1) 1) Learn compiler commands N N = = at atoi oi(ar argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); *restrict : flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); “ I promise y does not alias #in incl clude de < <st stdli lib. b.h> for or (i (int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x” x[ x[i] = = 2 2.0f 0f; voi oid sax axpy py(in int n, n, y[ y[i] = = 1 1.0f 0f; fl floa oat a, a, } floa fl oat *x *x, floa fl oat *r *res estri rict ct y) y) sax axpy(N, N, 3 3.0f 0f, , x, x, y y); { #pr prag agma a ac acc ker erne nels ls ret eturn rn 0; 0; for or (int nt i = = 0; 0; i < n n; + ++i) } y[i] = a * x[i] + y[i]; y[ ]; } 12

COMPILE AND RUN C: pgcc – acc -ta=nvidia -Minfo=accel – o saxpy_acc saxpy.c Fortran: pgf90 – acc -ta=nvidia -Minfo=accel – o saxpy_acc saxpy.f90 Compiler output: pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o o sax axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, Loop is parallelizable Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo oop w wor orke ker, r, ve vect ctor or(25 256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC CC 2 2.0 .0 : : 8 8 re regi giste ters rs; ; 4 4 sha hare red, d, 64 64 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy 13

EXAMPLE: JACOBI ITERATION Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1) 14

JACOBI ITERATION C CODE while while ( error > tol tol && iter iter < < iter_max ) { ) { Iterate until converged error=0.0; for for( ( int j = 1; j < n-1; j++) { Iterate across matrix for(int for int i = 1; = 1; i < m m-1; 1; i++) { ++) { elements Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + neighbors A[j-1][ ][i] + A[j+1][i]); ]); Compute max error for error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); convergence } } for( for ( int j = 1; j < n-1; j++) { for( for ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { Swap input/output arrays A[j][i] = A A[j][ ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; } 15

JACOBI ITERATION FORTRAN CODE do do while ( err > tol tol .and. iter < < iter_max ) Iterate until converged err=0 err=0._ ._fp_kind do do j=1,m Iterate across matrix do do i=1,n =1,n elements Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & Calculate new value from A(i A( , j-1) + A( , j ) + A(i , j+1)) neighbors err = = max(err (err, Anew(i,j) ) - A( A(i,j)) )) end do end do Compute max error for convergence end end do do do do j=1,m-2 do i=1,n do =1,n-2 A(i,j A( i,j) = Anew(i,j) Swap input/output arrays end do en end d do do iter iter = = iter iter +1 +1 end do end do 16

EXERCISES General instructions (compiling) Exercises are in “exercises/ openacc ” directory Solutions in “ exercise_solutions/openacc ” directory module load pgi/14.6 To compile, use one of the provided makefiles C: > make Fortran: > make – f Makefile_f90 Remember these flags -acc – ta=nvidia – Minfo=accel 17

EXERCISES General instructions (running) To run, use sbatch with one of the provided job files > sbatch runit.acc > qstat – u <username> # prints qsub status Output is placed in slurm.* when finished. 18

EXERCISE 1 Jacobi kernels Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice) In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc Q: can you get a speedup with just kernels directives? Versus 1 CPU core? Versus 6 CPU cores? 19

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - PowerPoint PPT Presentation

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum Drop - in Applications Flexibility Acceleration 2 OPENACC DIRECTIVES CPU GPU

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

The 2018 LVD Guide What suppliers of electrical equipment to the EU need to know C&R

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

RADIO LOCKDOWN DIRECTIVE Major Threat for Free Software on Radio Devices Max Mehl Coordinator

What Is Real and What Is Not in the Global FDI Network? Jannick Damgaard (Danmarks Nationalbank)

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

Introduction to Parallel Computing George Karypis Programming Shared Address Space Platforms

Protocol on SEA Introduction to the Manual Resource Manual to Support Application of the UNECE

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - PowerPoint PPT Presentation

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum Drop - in Applications Flexibility Acceleration 2 OPENACC DIRECTIVES CPU GPU

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Using GPU VSIPL &amp; CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

The 2018 LVD Guide What suppliers of electrical equipment to the EU need to know C&amp;R

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

RADIO LOCKDOWN DIRECTIVE Major Threat for Free Software on Radio Devices Max Mehl Coordinator

What Is Real and What Is Not in the Global FDI Network? Jannick Damgaard (Danmarks Nationalbank)

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

Introduction to Parallel Computing George Karypis Programming Shared Address Space Platforms

Protocol on SEA Introduction to the Manual Resource Manual to Support Application of the UNECE

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE

Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation Accelerate RF Clutter Simulation

The 2018 LVD Guide What suppliers of electrical equipment to the EU need to know C&R