OpenACC Tutorial GridKa School 2017: make science && run - PowerPoint PPT Presentation

Member of the Helmholtz Association Dot Product Andreas Herten | OpenACC Tutorial | 31 August 2017 TASK 0 Mandelbrot GEMM # 9 111 Some Applications Getting GPU -Acquainted DDot Benchmark DGEMM Benchmark 10 4 2000 CPU CPU GPU GPU 1750 10 3 1500 1250 10 2 MFLOP/s GFLOP/s 1000 750 10 1 500 10 0 250 0 10 3 10 4 10 5 10 6 10 7 10 8 10 9 2000 4000 6000 8000 10000 12000 14000 16000 Size of Square Matrix Length of Vector N-Body Benchmark Mandelbrot Benchmark 17500 1 GPU SP 1000 2 GPUs SP 15000 4 GPUs SP 1 GPU DP 800 12500 2 GPUs DP 4 GPUs DP N-Body GFLOP/s 10000 MPixel/s 600 7500 400 5000 200 2500 CPU GPU 0 0 20000 40000 60000 80000 100000 120000 5000 10000 15000 20000 25000 30000 Number of Particles Width of Image

t p N s N t p N 100 Parallel Portion: 50% Parallel Portion: 75% 80 Parallel Portion: 90% Parallel Portion: 95% 60 Speedup Parallel Portion: 99% 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors Andreas Herten | OpenACC Tutorial | 31 August 2017 Member of the Helmholtz Association Primer on Parallel Scaling Efgiciency: t s t p t s t t N Speedup s N t s N Processors t N Possible maximum speedup for N parallel processors Amdahl’s Law # 10 111 Total Time t = t s erial + t p arallel

s N t p N 100 Parallel Portion: 50% Parallel Portion: 75% 80 Parallel Portion: 90% Parallel Portion: 95% 60 Speedup Parallel Portion: 99% 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors Andreas Herten | OpenACC Tutorial | 31 August 2017 Member of the Helmholtz Association Primer on Parallel Scaling Efgiciency: t s t p t s t t N Speedup s N Possible maximum speedup for N parallel processors Amdahl’s Law # 10 111 Total Time t = t s erial + t p arallel N Processors t ( N ) = t s + t p / N

100 Parallel Portion: 50% Parallel Portion: 75% 80 Parallel Portion: 90% Parallel Portion: 95% 60 Speedup Parallel Portion: 99% 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors Andreas Herten | OpenACC Tutorial | 31 August 2017 Member of the Helmholtz Association Primer on Parallel Scaling N Possible maximum speedup for N parallel processors Amdahl’s Law # 10 111 Total Time t = t s erial + t p arallel N Processors t ( N ) = t s + t p / N t s + t p Speedup s ( N ) = t / t ( N ) = Efgiciency: ε = s / t s + t p / N

Member of the Helmholtz Association N Andreas Herten | OpenACC Tutorial | 31 August 2017 Primer on Parallel Scaling # 10 111 Amdahl’s Law Possible maximum speedup for N parallel processors Total Time t = t s erial + t p arallel N Processors t ( N ) = t s + t p / N t s + t p Speedup s ( N ) = t / t ( N ) = Efgiciency: ε = s / t s + t p / N 100 Parallel Portion: 50% Parallel Portion: 75% 80 Parallel Portion: 90% Parallel Portion: 95% 60 Speedup Parallel Portion: 99% 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors

Member of the Helmholtz Association – John Gustafson Andreas Herten | OpenACC Tutorial | 31 August 2017 Primer on Parallel Scaling II # 11 111 number of processors, not fixing problem size. […] speedup should be measured by scaling the problem to the Gustafson-Barsis’s Law 4000 Serial Portion: 1% Serial Portion: 10% Serial Portion: 50% 3000 Serial Portion: 75% Speedup Serial Portion: 90% 2000 Serial Portion: 99% 1000 0 256512 1024 2048 4096 Number of Processors

Member of the Helmholtz Association ! Parallelism Parallel programming is not easy! Things to consider: Is my application computationally intensive enough ? What are the levels of parallelism ? How much data needs to be transferred ? Is the gain worth the pain? Andreas Herten | OpenACC Tutorial | 31 August 2017 # 12 111

Member of the Helmholtz Association Possibilities can ease the pain … OpenACC OpenMP Thrust PyCUDA CUDA Fortran CUDA OpenCL Andreas Herten | OpenACC Tutorial | 31 August 2017 # 13 111 Difgerent levels of closeness to GPU when GPU -programming, which

Member of the Helmholtz Association Summary of Acceleration Possibilities Application Libraries Directives Programming Languages Drop-in Acceleration Easy Acceleration Flexible Acceleration Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

Member of the Helmholtz Association Summary of Acceleration Possibilities Application Libraries Directives Programming Languages OpenACC Drop-in Acceleration Easy Acceleration Flexible Acceleration Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

Member of the Helmholtz Association OpenACC History 2013 OpenACC 2.0: More functionality, portability  2015 OpenACC 2.5: Enhancements, clarifications  2016 OpenACC 2.6 proposed (deep copy, …)  Also: Best practice guide  Andreas Herten | OpenACC Tutorial | 31 August 2017 # 15 111 2011 OpenACC 1.0 specification is released  NVIDIA, Cray , PGI , CAPS → https://www.openacc.org/

Member of the Helmholtz Association Everything’s connected OpenACC modeled afuer OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP But OpenMP 4.0 now also has ofgloading feature Fork/join model Master thread launches parallel child threads; merge afuer execution Andreas Herten | OpenACC Tutorial | 31 August 2017 # 16 111 Open{MP ↔ ACC}

Member of the Helmholtz Association master OpenMP join parallel fork master Master thread launches parallel child threads; merge afuer execution Fork/join model But OpenMP 4.0 now also has ofgloading feature Might eventually be absorbed into OpenMP … but specific for accelerators OpenACC modeled afuer OpenMP … Everything’s connected # 16 111 Open{MP ↔ ACC} Andreas Herten | OpenACC Tutorial | 31 August 2017

Member of the Helmholtz Association parallel OpenACC join parallel fork master master OpenMP join fork master master Master thread launches parallel child threads; merge afuer execution Fork/join model But OpenMP 4.0 now also has ofgloading feature Might eventually be absorbed into OpenMP … but specific for accelerators OpenACC modeled afuer OpenMP … Everything’s connected # 16 111 Open{MP ↔ ACC} Andreas Herten | OpenACC Tutorial | 31 August 2017

Member of the Helmholtz Association Modus Operandi Three-step program 1 Annotate code with directives, indicating parallelism 2 OpenACC-capable compiler generates accelerator-specific code 3 $uccess Andreas Herten | OpenACC Tutorial | 31 August 2017 # 17 111

Member of the Helmholtz Association ! ... Andreas Herten | OpenACC Tutorial | 31 August 2017 Portable across host systems and accelerator architectures variables OpenACC: Compiler directives, library routines, environment programs High level programming model for accelerators; heterogeneous Ignored by compiler which does not understand OpenACC !$acc end kernels do i = 1, 24 1 Directives !$acc kernels Fortran // ... for ( int i = 0; i < 23; i++) #pragma acc kernels C/C++ Compiler directives state intend to compiler pragmatic # 18 111

Member of the Helmholtz Association Simple and abstracted Compiler support — GCC Beta, limited coverage, OSS — Cray ??? Trust compiler to generate intended parallelism; check status output! No need to know ins’n’outs of accelerator; leave it to expert compiler engineers # 19 111 2 Compiler — PGI Best performance, great support, free One code can target difgerent accelerators: GPUs , or even multi-core CPUs → Portability Andreas Herten | OpenACC Tutorial | 31 August 2017

Member of the Helmholtz Association But: Use OpenACC together with other accelerator-targeting Andreas Herten | OpenACC Tutorial | 31 August 2017 Measure Compile Parallelism Expose techniques (CUDA, libraries, …) performance accessible Because of generalness : Sometimes not last bit of hardware Serial to fast parallel: more time needed Serial to parallel: fast Iteration is key # 20 111 3 $uccess Start simple → refine ⇒ Productivity

Member of the Helmholtz Association Start main Andreas Herten | OpenACC Tutorial | 31 August 2017 Wait Transfer Return to host Finish code Run code Wait for code program (except: async) OpenACC Execution Model Host waits until return accelerator is started Execution on accelerator transferred to Device code is on host Main program executes # 21 111

Member of the Helmholtz Association OpenACC Memory Model Host Memory Device Memory DMA Transfers Usually: Two separate memory spaces Data needs to be transferred to device for computation; needs to be transferred back for further evaluation — Transfers hidden from programmer – caution: latency, bandwidth, memory size — Memories are not coherent Andreas Herten | OpenACC Tutorial | 31 August 2017 # 22 111 — Compiler helps; GPU runtime helps

Member of the Helmholtz Association OpenACC Programming Model A binary perspective OpenACC interpretation needs to be activated as compile flag GCC gcc -fopenacc Additional flags possible to improve/modify compilation -ta=tesla:cc60 Use compute capability 6.0 -ta=tesla:lineinfo Add source code correlation into binary -ta=tesla:managed Use unified memory -fopenacc-dim=geom Use geom configuration for threads Andreas Herten | OpenACC Tutorial | 31 August 2017 # 23 111 PGI pgcc -acc [-ta=tesla]

Member of the Helmholtz Association OpenACC Programming Model A source code perspective Compiler directives, ignored by incapable compilers Similar to OpenMP Phi) Syntax C/C++ #pragma acc directive [clause, [, clause] ...] newline Syntax Fortran !$acc directive [clause, [, clause] ...] !$acc end directive Andreas Herten | OpenACC Tutorial | 31 August 2017 # 24 111 Support for GPU , multicore CPU , other accelerators (Intel Xeon

Member of the Helmholtz Association A Glimpse of OpenACC #pragma acc data copy(x[0:N],y[0:N]) #pragma acc parallel loop { for ( int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; } for ( int i=0; i<N; i++) { y[i] = i*x[i]+y[i]; } } Andreas Herten | OpenACC Tutorial | 31 August 2017 # 25 111

Member of the Helmholtz Association OpenACC by Example Andreas Herten | OpenACC Tutorial | 31 August 2017 # 26 111

Member of the Helmholtz Association OpenACC Workflow Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance Andreas Herten | OpenACC Tutorial | 31 August 2017 # 27 111

Member of the Helmholtz Association Jacobi Solver Andreas Herten | OpenACC Tutorial | 31 August 2017 Stencil Boundary Point Data Point # 28 111 Each iteration step: compute average of neighboring points Iterative solver, converges to correct value Example for acceleration: Jacobi solver Algorithmic description Example: 2D Poisson equation: ∇ 2 A ( x , y ) = B ( x , y ) A i , j + 1 A i − 1 , j A i + 1 , j A i , j − 1 A k + 1 ( i , j ) = − 1 4 ( B ( i , j ) − ( A k ( i − 1 , j ) + A k ( i , j + 1 ) , + A k ( i + 1 , j ) + A k ( i , j − 1 )))

Member of the Helmholtz Association for ( int iy = iy_start; iy < iy_end; iy++) { Andreas Herten | OpenACC Tutorial | 31 August 2017 } iter++; // same for iy } A[(ny-1)*nx+ix] = A[1*nx+ix]; = A[(ny-2)*nx+ix]; A[0*nx+ix] for ( int ix = ix_start; ix < ix_end; ix++) { }} A[iy*nx+ix] = Anew[iy*nx+ix]; for ( int ix = ix_start; ix < ix_end; ix++ ) { }} Jacobi Solver fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix+1] + A[iy*nx+ix-1] ( Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++) { while ( error > tol && iter < iter_max ) { Source code # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association for ( int iy = iy_start; iy < iy_end; iy++) { Andreas Herten | OpenACC Tutorial | 31 August 2017 Iterate until converged } iter++; // same for iy } A[(ny-1)*nx+ix] = A[1*nx+ix]; = A[(ny-2)*nx+ix]; A[0*nx+ix] for ( int ix = ix_start; ix < ix_end; ix++) { }} A[iy*nx+ix] = Anew[iy*nx+ix]; for ( int ix = ix_start; ix < ix_end; ix++ ) { }} Jacobi Solver fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix+1] + A[iy*nx+ix-1] ( Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++) { while ( error > tol && iter < iter_max ) { Source code # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association for ( int ix = ix_start; ix < ix_end; ix++ ) { Andreas Herten | OpenACC Tutorial | 31 August 2017 matrix elements Iterate across Iterate until converged } iter++; // same for iy } A[(ny-1)*nx+ix] = A[1*nx+ix]; = A[(ny-2)*nx+ix]; A[0*nx+ix] for ( int ix = ix_start; ix < ix_end; ix++) { }} A[iy*nx+ix] = Anew[iy*nx+ix]; for ( int iy = iy_start; iy < iy_end; iy++) { Jacobi Solver }} fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix+1] + A[iy*nx+ix-1] ( Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++) { while ( error > tol && iter < iter_max ) { Source code # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association iter++; for ( int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; } // same for iy } A[iy*nx+ix] = Anew[iy*nx+ix]; Iterate until converged Iterate across matrix elements Calculate new value from neighbors Andreas Herten | OpenACC Tutorial | 31 August 2017 }} for ( int ix = ix_start; ix < ix_end; ix++ ) { Jacobi Solver ( Source code while ( error > tol && iter < iter_max ) { for ( int ix = ix_start; ix < ix_end; ix++) { for ( int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - A[iy*nx+ix+1] + A[iy*nx+ix-1] for ( int iy = iy_start; iy < iy_end; iy++) { + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }} # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association } A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; } // same for iy iter++; Iterate until converged }} Iterate across matrix elements Calculate new value from neighbors Accumulate error Andreas Herten | OpenACC Tutorial | 31 August 2017 for ( int ix = ix_start; ix < ix_end; ix++) { A[iy*nx+ix] = Anew[iy*nx+ix]; Jacobi Solver for ( int ix = ix_start; ix < ix_end; ix++ ) { Source code while ( error > tol && iter < iter_max ) { for ( int ix = ix_start; ix < ix_end; ix++) { for ( int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }} for ( int iy = iy_start; iy < iy_end; iy++) { # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association } A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; } // same for iy iter++; Iterate until converged }} Iterate across matrix elements Calculate new value from neighbors Accumulate error Swap input/output Andreas Herten | OpenACC Tutorial | 31 August 2017 for ( int ix = ix_start; ix < ix_end; ix++) { A[iy*nx+ix] = Anew[iy*nx+ix]; Jacobi Solver for ( int ix = ix_start; ix < ix_end; ix++ ) { Source code while ( error > tol && iter < iter_max ) { for ( int ix = ix_start; ix < ix_end; ix++) { for ( int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }} for ( int iy = iy_start; iy < iy_end; iy++) { # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association Iterate until converged = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; } // same for iy iter++; } Iterate across for ( int ix = ix_start; ix < ix_end; ix++) { matrix elements Calculate new value from neighbors Accumulate error Swap input/output Set boundary conditions Andreas Herten | OpenACC Tutorial | 31 August 2017 A[0*nx+ix] }} Jacobi Solver A[iy*nx+ix+1] + A[iy*nx+ix-1] Source code while ( error > tol && iter < iter_max ) { for ( int ix = ix_start; ix < ix_end; ix++) { for ( int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix] = Anew[iy*nx+ix]; fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }} for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++ ) { # 29 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association Identify Parallelism Generate Profile TASK 1 Use pgprof to analyze unaccelerated version of Jacobi solver Investigate! Task 1: Analyze Application Change to Task1/ directory Compile: make task1 Usually, compile just with make (but this exercise is special) Submit profiling run to the batch system: make task1_profile Study bsub call and pgprof call; try to understand ??? Where is hotspot? Which parts should be accelerated? Andreas Herten | OpenACC Tutorial | 31 August 2017 # 31 111

Member of the Helmholtz Association 105, Loop not vectorized: data dependency Andreas Herten | OpenACC Tutorial | 31 August 2017 Vectorization, FMA, unrolling Automated optimization of compiler, due to -fast Loop unrolled 8 times Loop not vectorized: data dependency 123, Loop not fused: different loop trip count 98, FMA (fused multiply-add) instruction(s) generated Profile of Application FMA (fused multiply-add) instruction(s) generated 68, Generated vector simd code for the loop main: poisson2d.c: poisson2d_reference.o poisson2d.c -o poisson2d $ pgcc -DUSE_DOUBLE -Minfo=all,intensity -fast -Minfo=ccff -Mprof=ccff Info during compilation # 32 111

Member of the Helmholtz Association __c_mcopy8 (0xffcc0054) main (poisson2d.c:123 0x398) 0.78% 9.9999ms __xlmass_expd2 (0xffcc011c) 0.78% 9.9999ms 0.78% 0.78% 9.9999ms __xlmass_expd2 (0xffcc0034) ======== Data collected at 100Hz frequency Since everything is in main – limited helpfulness Let’s look into main ! Andreas Herten | OpenACC Tutorial | 31 August 2017 9.9999ms main (poisson2d.c:128 0x348) Profile of Application 999.99ms Info during run ======== CPU profiling result (flat): Time(%) Time Name 77.52% main (poisson2d.c:148 0x6d8) 9.9999ms 9.30% 120ms main (0x704) 7.75% 99.999ms main (0x718) 0.78% # 33 111 78 % in main()

Member of the Helmholtz Association Data dependency = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; } // same for iy iter++; } between iterations for ( int ix = ix_start; ix < ix_end; ix++) { Independent loop iterations Independent loop iterations Independent loop iterations Andreas Herten | OpenACC Tutorial | 31 August 2017 A[0*nx+ix] }} Code Independency Analysis A[iy*nx+ix+1] + A[iy*nx+ix-1] What is independent? while ( error > tol && iter < iter_max ) { for ( int ix = ix_start; ix < ix_end; ix++) { for ( int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix] = Anew[iy*nx+ix]; fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }} for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++ ) { # 34 111 error = 0.0; error = fmaxr(error, ֒ →

Member of the Helmholtz Association Parallel Loops: Parallel Maybe the second most important directive Implicit barrier at end of parallel region Each gang executes same code sequentially  OpenACC: parallel #pragma acc parallel [clause, [, clause] ...] newline {structured block} Andreas Herten | OpenACC Tutorial | 31 August 2017 # 36 111 Programmer identifies block containing parallelism → compiler generates GPU code ( kernel ) Program launch creates gangs of parallel threads on GPU

Member of the Helmholtz Association Parallel Loops: Parallel Clauses Diverse clauses to augment the parallel region private(var) A copy of variables var is made for each gang firstprivate(var) Same as private , except var will initialized with value from host if(cond) Parallel region will execute on accelerator only if cond is true reduction(op:var) Reduction is performed on variable var with operation op ; supported: + * max min … async[(int)] No implicit barrier at end of parallel region Andreas Herten | OpenACC Tutorial | 31 August 2017 # 37 111

Member of the Helmholtz Association Parallel Loops: Loops Maybe the third most important directive Programmer identifies loop eligible for parallelization Directive must be directly before loop Optional: Describe type of parallelism  OpenACC: loop #pragma acc loop [clause, [, clause] ...] newline {structured block} Andreas Herten | OpenACC Tutorial | 31 August 2017 # 38 111

Member of the Helmholtz Association Parallel Loops: Loops Clauses if in parallel region (and no seq or auto )) collapse(int) Collapse int tightly-nested loops seq This loop is to be executed sequentially (not parallel) tile(int[,int]) Split loops into loops over tiles of the full size auto Compiler decides what to do Andreas Herten | OpenACC Tutorial | 31 August 2017 # 39 111 independent Iterations of loop are data-independent (implied

Member of the Helmholtz Association Parallel Loops: Parallel Loops Maybe the most important directive Combined directive: shortcut Because its used so ofuen Any clause that is allowed on parallel or loop allowed Restriction: May not appear in body of another parallel region  OpenACC: parallel loop #pragma acc parallel loop [clause, [, clause] ...] Andreas Herten | OpenACC Tutorial | 31 August 2017 # 40 111

Member of the Helmholtz Association for ( int i=0; i<N; i++) { Andreas Herten | OpenACC Tutorial | 31 August 2017 Kernel 2 Kernel 1 } } sum+=y[i]; { Parallel Loops Example #pragma acc parallel loop reduction(+:sum) } for ( int i=0; i<N; i++) { #pragma acc parallel loop double sum = 0.0; # 41 111 x[i] = 1.0; y[i] = 2.0; y[i] = i*x[i]+y[i];

Member of the Helmholtz Association Compile: make Andreas Herten | OpenACC Tutorial | 31 August 2017 pgprof or nvprof is prefix to call to poisson2d Profile: make profile matrix sizes Adapt the bsub call and run with other number of iterations, Submit parallel run to the batch system: make run Change to Task2/ directory Parallel Jacobi Task 2: A First Parallel Loop Profile code Add OpenACC parallelism to main loop in Jacobi TASK 2 Add parallelism # 42 111 → Congratulations, you are a GPU developer!

Member of the Helmholtz Association 109, Generating reduction(max:error) Andreas Herten | OpenACC Tutorial | 31 August 2017 Loop carried backward dependence of Anew-> prevents vectorization Loop carried dependence of Anew-> prevents parallelization 112, Complex loop carried dependence of Anew-> prevents parallelization Generating implicit copyout(Anew[:]) 109, Generating implicit copyin(A[:],rhs[:]) 112, #pragma acc loop seq 110, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ Generating Tesla code Parallel Jacobi 109, Accelerator kernel generated main: poisson2d.c: poisson2d_reference.o -o poisson2d pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed poisson2d.c poisson2d_reference.c -o poisson2d_reference.o pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed $ make Compilation result # 43 111

Member of the Helmholtz Association 200, 0... Andreas Herten | OpenACC Tutorial | 31 August 2017 6.29 9.5541 s, speedup: 60.0827 s, This: 2048x2048: Ref: 200, 0... 100, 0.249760 0, 0.249999 Calculate current execution. 100, 0.249760 Parallel Jacobi 0, 0.249999 Calculate reference solution and time with serial CPU execution. Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh <<Starting on juronc11>> <<Waiting for dispatch ...>> Job <4444> is submitted to default queue <normal.i>. bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d $ make run Run result # 44 111

Member of the Helmholtz Association pgprof / nvprof NVIDIA’s command line profiler pgprof vs nvprof : Twins with other configurations Generate concise performance reports, full timelines; measure events and metrics (hardware counters) Andreas Herten | OpenACC Tutorial | 31 August 2017 # 45 111 Profiles applications, mainly for NVIDIA GPUs , but also CPU code GPU : CUDA kernels, API calls, OpenACC ⇒ Powerful tool for GPU application analysis → http://docs.nvidia.com/cuda/profiler-users-guide/

Member of the Helmholtz Association 204.80KB - - - - 2454 Device To Host 30.94435ms 640.0000MB 960.00KB 64.000KB 3200 GPU Page fault groups Host To Device 25.37254ms 672.0000MB 960.00KB 64.000KB 204.80KB 3360 Name Total Time Total Size 66.99111ms Total CPU Page faults: 2304 Min Size 564ns Andreas Herten | OpenACC Tutorial | 31 August 2017 cuDevicePrimaryCtxRelease 74.126ms 69.684ms 72.449ms 4 289.79ms 26.35% cuDevicePrimaryCtxRetain 189.20ms 127.96ms ==116606== API calls: 5 639.81ms 58.17% Name Max Min Avg Calls Time Time(%) Max Size Avg Size Profile of Jacobi Time(%) 12.982ms 10 129.82ms 99.96% Name Max Min Avg Calls Time ==116606== Profiling result: 20.086ms 3.08 0.2716 s, speedup: 0.8378 s, This: 2048x2048: Ref: Calculate reference solution and time with serial CPU execution. Jacobi relaxation calculation: max 10 iterations on 2048 x 2048 mesh ==116606== Profiling application: ./poisson2d 10 ==116606== PGPROF is profiling process 116606, command: ./poisson2d 10 $ make profile With pgprof 11.204ms main_109_gpu Count 1.2480us Device "Tesla P100-SXM2-16GB (0)" ==116606== Unified Memory profiling result: [CUDA memcpy DtoH] 672ns 608ns 636ns 10 6.3680us 0.00% [CUDA memcpy HtoD] 960ns 0.02% 1.0300us 10 10.304us 0.01% main_109_gpu_red 3.8720us 2.6240us 3.0560us 10 30.560us # 46 111

Member of the Helmholtz Association 64.000KB 66.99111ms - - - - 2454 Device To Host 30.94435ms 640.0000MB 960.00KB 204.80KB Total CPU Page faults: 2304 3200 Host To Device 25.37254ms 672.0000MB 960.00KB 64.000KB 204.80KB 3360 Name Total Time GPU Page fault groups ==116606== API calls: Max Size cuDevicePrimaryCtxRetain Andreas Herten | OpenACC Tutorial | 31 August 2017 Let’s do the rest! Only one function is parallelized! cuDevicePrimaryCtxRelease 74.126ms 69.684ms 72.449ms 4 289.79ms 26.35% 189.20ms Time(%) 564ns 127.96ms 5 639.81ms 58.17% Name Max Min Avg Calls Time Total Size Min Size Profile of Jacobi Time(%) 12.982ms 10 129.82ms 99.96% Name Max Min Avg Calls Time ==116606== Profiling result: 20.086ms 3.08 0.2716 s, speedup: 0.8378 s, This: 2048x2048: Ref: Calculate reference solution and time with serial CPU execution. Jacobi relaxation calculation: max 10 iterations on 2048 x 2048 mesh ==116606== Profiling application: ./poisson2d 10 ==116606== PGPROF is profiling process 116606, command: ./poisson2d 10 $ make profile With pgprof 11.204ms main_109_gpu Avg Size [CUDA memcpy HtoD] Count Device "Tesla P100-SXM2-16GB (0)" ==116606== Unified Memory profiling result: [CUDA memcpy DtoH] 672ns 608ns 636ns 10 6.3680us 0.00% 1.2480us 0.02% 960ns 1.0300us 10 10.304us 0.01% main_109_gpu_red 3.8720us 2.6240us 3.0560us 10 30.560us # 46 111

Member of the Helmholtz Association More Parallelism: Kernels More freedom for compiler Kernels directive: second way to expose parallelism Region may contain parallelism Compiler determines parallelization opportunities Rest: Same as for parallel  OpenACC: kernels #pragma acc kernels [clause, [, clause] ...] newline structured block Andreas Herten | OpenACC Tutorial | 31 August 2017 # 47 111 → More freedom for compiler

Member of the Helmholtz Association for ( int i=0; i<N; i++) { Andreas Herten | OpenACC Tutorial | 31 August 2017 Kernels created here } } sum+=y[i]; } Kernels Example for ( int i=0; i<N; i++) { { #pragma acc kernels double sum = 0.0; # 48 111 x[i] = 1.0; y[i] = 2.0; y[i] = i*x[i]+y[i];

— Compiler performs parallel analysis — Requires parallel analysis by programmer — Similar to OpenMP Andreas Herten | OpenACC Tutorial | 31 August 2017 At most: One if clause Program must not depend on order of evaluation of clauses No braunching into or out Both regions may not contain other kernels / parallel regions Member of the Helmholtz Association — Will also parallelize what compiler may miss kernels vs. parallel parallel — Gives compiler additional leeway — Can cover large area of code with single directive kernels Both approaches equally valid; can perform equally well # 49 111

Member of the Helmholtz Association — Will also parallelize what compiler may miss Andreas Herten | OpenACC Tutorial | 31 August 2017 At most: One if clause Program must not depend on order of evaluation of clauses No braunching into or out Both regions may not contain other kernels / parallel regions — Similar to OpenMP — Requires parallel analysis by programmer kernels vs. parallel parallel — Gives compiler additional leeway — Can cover large area of code with single directive — Compiler performs parallel analysis kernels Both approaches equally valid; can perform equally well # 49 111

Member of the Helmholtz Association Parallel Jacobi II Add more parallelism TASK 3 Add OpenACC parallelism to other loops of while (L:123 – L:141) Use either kernels or parallel Do they perform equally well? Task 3: More Parallel Loops Change to Task3/ directory Change source code Compile: make Study the compiler output! Submit parallel run to the batch system: make run Andreas Herten | OpenACC Tutorial | 31 August 2017 # 50 111

Member of the Helmholtz Association 109, ... Andreas Herten | OpenACC Tutorial | 31 August 2017 133, Accelerator kernel genera... 126, Loop is parallelizable Generating implicit copyout(A[:]) 121, Generating implicit copyin(Anew[:]) 126, #pragma acc loop vector(128) /* threadIdx.x */ 124, #pragma acc loop gang /* blockIdx.x */ Generating Tesla code 121, Accelerator kernel generated 112, #pragma acc loop seq Parallel Jacobi II 110, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 109, Generating reduction(max:error) Generating Tesla code 109, Accelerator kernel generated main: poisson2d.c: poisson2d_reference.c -o poisson2d_reference.o pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed $ make Compilation result # 51 111

Member of the Helmholtz Association 200, 0... Andreas Herten | OpenACC Tutorial | 31 August 2017 158.45 0.4099 s, speedup: 64.9401 s, This: 2048x2048: Ref: 200, 0... 100, 0.249760 0, 0.249999 Calculate current execution. 100, 0.249760 Parallel Jacobi II 0, 0.249999 Calculate reference solution and time with serial CPU execution. Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh <<Starting on juronc15>> <<Waiting for dispatch ...>> Job <4458> is submitted to default queue <normal.i>. bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d $ make run Run result # 52 111

Member of the Helmholtz Association Calculate current execution. Andreas Herten | OpenACC Tutorial | 31 August 2017 Done?! 158.45 0.4099 s, speedup: 64.9401 s, This: 2048x2048: Ref: 200, 0... 100, 0.249760 0, 0.249999 200, 0... Parallel Jacobi II 100, 0.249760 0, 0.249999 Calculate reference solution and time with serial CPU execution. Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh <<Starting on juronc15>> <<Waiting for dispatch ...>> Job <4458> is submitted to default queue <normal.i>. bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d $ make run Run result # 52 111

Member of the Helmholtz Association for ( int ix = ix_start; ix < ix_end; ix++ ) { Andreas Herten | OpenACC Tutorial | 31 August 2017 } iter++; // same for iy } A[(ny-1)*nx+ix] = A[1*nx+ix]; = A[(ny-2)*nx+ix]; A[0*nx+ix] for ( int ix = ix_start; ix < ix_end; ix++) { #pragma acc parallel loop }} A[iy*nx+ix] = Anew[iy*nx+ix]; for ( int iy = iy_start; iy < iy_end; iy++) { Parallel Jacobi #pragma acc parallel loop }} + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); A[iy*nx+ix+1] + A[iy*nx+ix-1] ( Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - for ( int iy = iy_start; iy < iy_end; iy++) { for ( int ix = ix_start; ix < ix_end; ix++) { #pragma acc parallel loop reduction(max:error) while ( error > tol && iter < iter_max ) { # 53 111 error = 0.0; error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

Member of the Helmholtz Association Automatic Data Transfers Up to now: We did not care about data transfers Compiler and runtime care Magic keyword: -ta=tesla: managed Andreas Herten | OpenACC Tutorial | 31 August 2017 # 54 111 Only feature of (recent) NVIDIA GPUs !

Member of the Helmholtz Association CUDA 4.0 Unified Virtual Addressing: pointer Andreas Herten | OpenACC Tutorial | 31 August 2017 data migrations (Pascal) driver, page faults on-demand initiate CUDA 8.0 Unified Memory (truly): Data copy by but whole data at once (Kepler) CUDA 6.0 Unified Memory*: Data copy by driver, manual from same address pool, but data copy distinct, own addresses L2 Interconnect Scheduler DRAM CPU CPU Memory Location, location, location # 55 111 GPU Memory Spaces At the Beginning CPU and GPU memory very . . .

Member of the Helmholtz Association distinct, own addresses Andreas Herten | OpenACC Tutorial | 31 August 2017 data migrations (Pascal) driver, page faults on-demand initiate CUDA 8.0 Unified Memory (truly): Data copy by but whole data at once (Kepler) CUDA 6.0 Unified Memory*: Data copy by driver, manual from same address pool, but data copy CUDA 4.0 Unified Virtual Addressing: pointer Addressing Virtual Unified L2 Interconnect Scheduler DRAM CPU CPU Memory Location, location, location # 55 111 GPU Memory Spaces At the Beginning CPU and GPU memory very . . .

Member of the Helmholtz Association distinct, own addresses Andreas Herten | OpenACC Tutorial | 31 August 2017 data migrations (Pascal) driver, page faults on-demand initiate CUDA 8.0 Unified Memory (truly): Data copy by but whole data at once (Kepler) CUDA 6.0 Unified Memory*: Data copy by driver, manual from same address pool, but data copy CUDA 4.0 Unified Virtual Addressing: pointer # 55 111 Memory Unified L2 Interconnect Scheduler DRAM CPU CPU Memory Location, location, location GPU Memory Spaces At the Beginning CPU and GPU memory very . . .

Member of the Helmholtz Association poisson2d.c: Andreas Herten | OpenACC Tutorial | 31 August 2017 PGC/power Linux 17.4-0: compilation completed with severe errors ... symbol (poisson2d.c: 110) (see -Minfo messages): Could not find allocated-variable index for PGC-S-0155-Compiler failed to translate accelerator region poisson2d_reference.c -o poisson2d_reference.o Portability pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 $ make Let’s remove it from compile flags! Great OpenACC features: Portability # 56 111 Managed memory: Only NVIDIA GPU feature → Code should also be fast without -ta=tesla: managed !

Member of the Helmholtz Association Copy Statements Compiler implicitly created copy clauses to copy data to device 134, Generating implicit copyin(A[:]) Generating implicit copyout(A[nx*(ny-1)+1:nx-2]) It couldn’t determine length of copied data … …but before: no problem – Unified Memory! Now: Problem! We need to give that information! (see also later)  OpenACC: copy #pragma acc parallel copy(A[ start : end ]) Also: copyin(B[ s : e ]) copyout(C[ s : e ]) present(D[ s : e ]) create(E[ s : e ]) Andreas Herten | OpenACC Tutorial | 31 August 2017 # 57 111

OpenACC Tutorial GridKa School 2017: make science && run - PowerPoint PPT Presentation

Member of the Helmholtz Association OpenACC Tutorial GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Member of the Helmholtz Association Interoperability Visual Profiler Data Locality

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

Exploring Quantum Secret Sharing with the ZX Calculus Vladimir Nikolaev Zamdzhiev Oriel College,

Power Efficiency in Smart Camera Chips Ricardo Carmona-Galn, Jorge Fernndez-Berni, M. Trevisi

rt t rs

RE-TARGETABLE GRAMMAR BASED TEST CASE GENERATION 2 TESTING PARSERS IS HARD 3 HOW WE GOT

Software Security Dynamic analysis and fuzz testing Bhargava Shastry Prof. Jean-Pierre Seifert

Bro stuff Justin Azoff Aug 4, 2015 try.bro.org on github Figure : try.bro on github Bro

Disclosures Atrial Flutter: Optimal Management Major Strategies in 2016 Research grant: