ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER - PowerPoint PPT Presentation

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

AGENDA OpenACC Review Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU Programming OpenACC Interoperability Atomic Directive Misc. Advice & Techniques Next Steps

OPENACC REVIEW

WHAT ARE COMPILER DIRECTIVES? Programmer inserts compiler hints. int main() { int main() { Execution Begins on the CPU. do_serial_stuff() do_serial_stuff() Data and Execution moves to the GPU. Compiler Generates GPU Code #pragma acc parallel loop for(int i=0; i < BIGN; i++) for(int i=0; i < BIGN; i++) { { …compute intensive work …compute intensive work } } do_more_serial_stuff(); do_more_serial_stuff(); Data and Execution returns to the CPU. } }

OPENACC: THE STANDARD FOR GPU DIRECTIVES Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Portable: GPU Directives represent parallelism at a high level, allowing portability to a wide range of architectures with the same code.

Identify Available Parallelism Optimize Parallelize Loop Loops with Performance OpenACC Optimize Data Locality

JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + neighbors A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; } 7

JACOBI: FINAL CODE #pragma acc data copy(A) create(Anew) Optimized Data Locality while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) Parallelized Loop for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } Parallelized Loop #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; }

Speed-Up (Higher is Better) 30.00X 27.30X 25.00X 20.00X Socket/Socket: 6.24X 15.00X Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell) 10.00X vs. NVIDIA Tesla K40 5.00X 4.38X 0.82X 1.00X 0.00X SINGLE THREAD 8 THREADS OPENACC (STEP 1) OPENACC (STEP 2)

Identify Available Parallelism Optimize Parallelize Loop Loops with Performance OpenACC Optimize Data Locality

SPARSE MATRIX/VECTOR PRODUCT Performs Mat/Vec product 99 do i=1,a%num_rows 100 tmpsum = 0.0d0 of sparse matrix 101 row_start = arow_offsets(i) Matrices are stored in a 102 row_end = arow_offsets(i+1)-1 row-compressed format 103 do j=row_start,row_end Parallelism per-row will 104 acol = acols(j) 105 acoef = acoefs(j) vary, but is generally not 106 xcoef = x(acol) very large 107 tmpsum = tmpsum + acoef*xcoef 108 enddo 109 y(i) = tmpsum 110 enddo

PARALLELIZED SPMV Data already on device 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & Compiler has vectorized 108 !$acc& reduction(+:tmpsum) the loop at 113 and 109 do i=1,a%num_rows selected a vector length 110 tmpsum = 0.0d0 111 row_start = arow_offsets(i) of 256 112 row_end = arow_offsets(i+1)-1 Total application speed- 113 do j=row_start,row_end up (including other 114 acol = acols(j) 115 acoef = acoefs(j) accelerated routines): 116 xcoef = x(acol) 1.08X 117 tmpsum = tmpsum + acoef*xcoef 118 enddo 119 y(i) = tmpsum 120 enddo

OPENACC: 3 LEVELS OF PARALLELISM • Vector threads work in Vector lockstep (SIMD/SIMT Workers parallelism) • Workers compute a vector Gang • Gangs have 1 or more workers and share resources Vector (such as cache, the Workers streaming multiprocessor, etc.) Gang • Multiple gangs work independently of each other

OPENACC GANG, WORKER, VECTOR CLAUSES gang, worker, and vector can be added to a loop clause A parallel region can only specify one of each gang, worker, vector Control the size using the following clauses on the parallel region num_gangs(n), num_workers(n), vector_length(n) #pragma acc kernels loop gang #pragma acc parallel vector_length(128) for (int i = 0; i < n; ++i) #pragma acc loop gang #pragma acc loop vector(128) for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) #pragma acc loop vector ... for (int j = 0; j < n; ++j) ...

OPTIMIZED SPMV VECTOR LENGTH 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 3.50X 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & 3.00X 108 !$acc& vector_length(32) 109 do i=1,a%num_rows 2.50X 110 tmpsum = 0.0d0 111 row_start = arow_offsets(i) Speed-up 2.00X 112 row_end = arow_offsets(i+1)-1 113 !$acc loop vector reduction(+:tmpsum) 1.50X 114 do j=row_start,row_end 115 acol = acols(j) 1.00X 116 acoef = acoefs(j) 117 xcoef = x(acol) 0.50X 118 tmpsum = tmpsum + acoef*xcoef 0.00X 119 enddo 1024 512 256 128 64 32 120 y(i) = tmpsum OpenACC Vector Length for SPMV 121 enddo

PERFORMANCE LIMITER: OCCUPANCY We need more threads!

INCREASED PARALLELISM WITH WORKERS 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 2.00X 6X to Original 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & 1.80X 108 !$acc& gang worker vector_length(32) num_workers(32) 1.60X 109 do i=1,a%num_rows 110 tmpsum = 0.0d0 1.40X 111 row_start = arow_offsets(i) 1.20X 112 row_end = arow_offsets(i+1)-1 Speed-up 113 !$acc loop vector reduction(+:tmpsum) 1.00X 114 do j=row_start,row_end 0.80X 115 acol = acols(j) 116 acoef = acoefs(j) 0.60X 117 xcoef = x(acol) 0.40X 118 tmpsum = tmpsum + acoef*xcoef 119 enddo 0.20X 120 y(i) = tmpsum 0.00X 121 enddo 2 4 8 16 32 Number of Workers

PERFORMANCE LIMITER: COMPUTE Now we’re compute bound

PERFORMANCE LIMITER: PARALLELISM Really, we’re limited by parallelism per-row.

SPEED-UP STEP BY STEP 7.00X Parallelize Optimize Data Optimize Loops Identify Locality Parallelism 6.00X 5.00X 4.00X Speed-up 3.00X 2.00X 1.00X 0.00X 0 1 2 3 4 5 6

OPENACC COLLAPSE CLAUSE collapse(n): Transform the following n tightly nested loops into one, flattened loop. • Useful when individual loops lack sufficient parallelism or more than 3 loops are nested (gang/worker/vector) #pragma acc parallel #pragma acc parallel #pragma acc loop collapse(2) #pragma acc loop for(int i=0; i<N; i++) for(int ij=0; ij<N*N; ij++) for(int j=0; j<N; j++) ... ... Loops must be tightly nested

NEW CASE STUDY: MANDELBROT SET Application generates the image to the right. Each pixel in the image can be independently calculated. Skills Used: Parallel Loop Data Region Update Directive Asynchronous Pipelining

MANDELBROT CODE // Calculate value for a pixel unsigned char mandelbrot(int Px, int Py) { double x0=xmin+Px*dx; double y0=ymin+Py*dy; The mandelbrot() function calculates double x=0.0; double y=0.0; the color for each pixel. for(int i=0;x*x+y*y<4.0 && i<MAX_ITERS;i++) { double xtemp=x*x-y*y+x0; y=2*x*y+y0; x=xtemp; } return (double)MAX_COLOR*i/MAX_ITERS; } // Used in main() Within main() there is a doubly-nested for(int y=0;y<HEIGHT;y++) { for(int x=0;x<WIDTH;x++) { loop that calculates each pixel image[y*WIDTH+x]=mandelbrot(x,y); independently. } }

ROUTINES

OPENACC ROUTINE DIRECTIVE Specifies that the compiler should generate a device copy of the function/subroutine and what type of parallelism the routine contains. Clauses: gang/worker/vector/seq Specifies the level of parallelism contained in the routine. bind Specifies an optional name for the routine, also supplied at call-site no_host The routine will only be used on the device device_type Specialize this routine for a particular device type. 25

MANDELBROT: ROUTINE DIRECTIVE At function source: // mandelbrot.h #pragma acc routine seq Function needs to be built for unsigned char mandelbrot(int Px, int Py); the GPU. It will be called by each thread // Used in main() (sequentially) #pragma acc parallel loop At call the compiler needs to know: for(int y=0;y<HEIGHT;y++) { Function will be available on for(int x=0;x<WIDTH;x++) { the GPU image[y*WIDTH+x]=mandelbrot(x,y); It is a sequential routine } }

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER - PowerPoint PPT Presentation

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU Programming OpenACC Interoperability Atomic Directive

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus,

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance

Algorithms for Auto-tuning OpenACC Accelerated Kernels Fatemah

To The Masses With OpenACC John Urbanic What you (may) already know 1000 CPU Power (W) 100

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Using OpenACC for NGS Techniques to Create a Portable and Easy-to- Use Code Base Sanhu Li (Ph.D.

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 Peter Robinson 2 1 University of

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE