GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - PowerPoint PPT Presentation

GPU Computing with OpenACC Directives

GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000’s Defense Oil & Gas Weather Climate Research Early Adopters Plasma Physics 2004 Present Time

3 Ways to Accelerate Applications Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration

OpenACC Directives CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code

Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } }

OpenACC The Standard for GPU Directives Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

Ope penA nACC CC Me Membe mbers s and nd Sup uppor porter ers

Directives: Easy & Powerful Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The ” most important thing is avoiding restructuring of existing code for production applications. -- Developer at the Global Manufacturer of Navigation Systems

Focus on Exposing Parallelism With Directives, tuning work focuses on exposing parallelism , which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL CAM-SE S3D Answer questions about specific Research more efficient climate change adaptation and combustion with next- mitigation scenarios generation fuels • Tuning top 3 kernels (90% of runtime) • Tuning top key kernel (50% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU 6.5x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50% Improved performance of CPU version by 100% • •

A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran subroutine subrouti ne saxpy(n, a, x, y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i float *x, $!acc $! acc kernels float *restrict y) do do i=1,n =1,n { y(i) = a*x(i)+y( y( )+y(i) #pragma acc #pragma acc kerne kernels ls enddo enddo for for (int i = 0; i < n; ++i) $!acc $! acc end kernels y[i] = a*x[i] + y[i]; end subroutine end subr outine saxpy saxpy } ... ... ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts // Perform SAX // Perfo rm SAXPY on 1M PY on 1M eleme elements nts call call saxpy(2**20, 2.0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ...

Directive Syntax Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

els : Your first OpenACC Directive kernel Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 kernel 1 b(i) = 1.0 Kernel: c(i) = 2.0 end do A parallel function that runs on the GPU do i=1,n kernel 2 a(i) = b(i) + c(i) end do !$acc end kernels

Kernels Construct Fortran C !$ !$acc acc kernels rnels [clause …] #prag ragma ma acc acc kern ernels els [clause …] stru tructure ctured b d bloc lock { st structu ructured red bl block } ock } !$ !$acc acc end ker d kernel nels Clauses if( if ( cond ondition ition ) as async nc( ( expres expressio ion ) Also, any data clause (more later)

C tip: the restric rict keyword Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float * float *restrict restrict ptr ptr Meaning: “for the lifetime of ptr tr , only it or a value directly derived from it (such as ptr tr + 1 ) will be used to access the object to which it points ” * Limits the effects of pointer aliasing OpenACC compilers often require restric trict to determine independence Otherwise the compiler can’t parallelize loops that access ptr tr Note: if programmer violates the declaration, behavior is undefined http://en.wikipedia.org/wiki/Restrict

Complete SAXPY example code Trivial first example int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { Apply a loop directive int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats Learn compiler commands if if (ar argc gc > > 1) 1) N N = = at atoi oi(ar argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); *restrict : #in incl clude de < <st stdli lib. b.h> “ I promise y does not alias x” for or (i (int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; voi oid sax axpy py(in int n, n, y[i] = 1.0f; y[ floa fl oat a, a, } floa fl oat *x *x, float *restrict y) y) sax axpy(N, N, 3 3.0f 0f, , x, x, y y); { #pr prag agma a ac acc ker erne nels ls ret eturn rn 0; 0; for or (int nt i = = 0; 0; i < n n; + ++i) } y[i] = y[ = a a * * x[ x[i] + + y[ y[i]; ]; }

Compile and run C: pgcc pgcc – acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel – o o saxpy_acc saxpy_acc saxpy.c saxpy.c Fortran: pgf90 pgf90 – acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel – o o saxpy_acc saxpy_acc saxpy.f90 saxpy.f90 Compiler output: pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o o sax axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, 9, L Loo oop i is s pa para ralle leli liza zable le Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo oop w wor orke ker, r, ve vect ctor or(25 256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC 1 CC 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy

Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1)

Jacobi Iteration C Code while while ( error > tol tol && iter iter < < iter_max ) { ) { Iterate until converged error=0.0; for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { Iterate across matrix for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { elements Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); neighbors Compute max error for error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); convergence } } for( for ( int j = 1; j < n-1; j++) { for( for ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; iter }

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - PowerPoint PPT Presentation

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000s Defense Oil & Gas Weather Climate

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

Protocol on SEA Introduction to the Manual Resource Manual to Support Application of the UNECE

Introduction to Parallel Computing George Karypis Programming Shared Address Space Platforms

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

Joint Crisis Committee Briefing to delegates 20 May 2013 Introduction What is the JCC?

Sta tart- t-up, or not? t? Module 5 Module 5 START-UP? START-UP? Bu Business Id Idea 3.

Library of Congress Classification: Module 2.3 1 Library of Congress Classification: Module 2.3

What Can we Expect ? Thomas Breda (PSE-CNRS) Chaire Travail PSE 07/06/2019 Plan I will focus

Sambuz

Useful Links

Newsletter

Mail Us

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - PowerPoint PPT Presentation

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000s Defense Oil & Gas Weather Climate

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

Protocol on SEA Introduction to the Manual Resource Manual to Support Application of the UNECE

Introduction to Parallel Computing George Karypis Programming Shared Address Space Platforms

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

Joint Crisis Committee Briefing to delegates 20 May 2013 Introduction What is the JCC?

Sta tart- t-up, or not? t? Module 5 Module 5 START-UP? START-UP? Bu Business Id Idea 3.

Library of Congress Classification: Module 2.3 1 Library of Congress Classification: Module 2.3

What Can we Expect ? Thomas Breda (PSE-CNRS) Chaire Travail PSE 07/06/2019 Plan I will focus

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,