L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March - PowerPoint PPT Presentation

L8179 – ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019

OUTLINE Topics to be covered ▪ What is OpenACC ▪ Profile-driven Development ▪ OpenACC with CUDA Unified Memory ▪ OpenACC Data Directives ▪ OpenACC Loop Optimizations ▪ Where to Get Help

ABOUT THIS SESSION ▪ The objective of this session is to give you a brief introduction of OpenACC programming for NVIDIA GPUs ▪ This is an instructor-led session, there will be no hands on portion ▪ For hands on experience, please consider attending DLIT903 - OpenACC - 2X in 4 Steps or L9112 - Programming GPU-Accelerated POWER Systems with OpenACC if your badge allows ▪ Feel free to interrupt with questions

INTRODUCTION TO OPENACC

OpenACC is a directives- Add Simple Compiler Directive based programming approach main() to parallel computing { <serial code> #pragma acc kernels designed for performance { <parallel code> } and portability on CPUs } and GPUs for HPC.

3 WAYS TO ACCELERATE APPLICATIONS Applications Programming Compiler Libraries Languages Directives Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility OpenACC

OPENACC PORTABILITY Describing a generic parallel machine Host ▪ OpenACC is designed to be portable to many existing and future parallel platforms Device ▪ The programmer need not think about specific hardware details, but rather express the parallelism in generic terms ▪ An OpenACC program runs on a host Host (typically a CPU) that manages one or more parallel devices (GPUs, etc.). The host and Memory device(s) are logically thought of as having Device separate memories. Memory

OPENACC Three major strengths Low Learning Curve Incremental Single Source

OPENACC Begin with a working Incremental sequential code. Enhance Sequential Code #pragma acc #pragma acc parallel loop ▪ Maintain existing for( i = 0; i < N; i++ ) for( i = 0; i < N; i++ ) { { sequential code < loop code > < loop code > Parallelize it with OpenACC. ▪ Add annotations to } } expose parallelism #pragma #pragma acc acc paral paralle lel l loo oop ▪ After verifying for( i = 0; i < N; i++ ) for( i = 0; i < N; i++ ) correctness, annotate { { Rerun the code to verify more of the code < loop code > < loop code > correct behavior, } } remove/alter OpenACC code as needed.

OPENACC Low Learning Curve Incremental Single Source ▪ Maintain existing sequential code ▪ Add annotations to expose parallelism ▪ After verifying correctness, annotate more of the code

OPENACC The compiler can ignore your Supported Platforms Single Source OpenACC code additions, so the same code can be used for parallel or POWER sequential execution. ▪ Rebuild the same code Sunway on multiple architectures int main(){ int main(){ x86 CPU ▪ Compiler determines ... ... x86 Xeon Phi how to parallelize for #pragma acc parallel loop the desired machine for(int i = 0; i < N; i++) for(int i = 0; i < N; i++) NVIDIA GPU ▪ Sequential code is < loop code > < loop code > maintained PEZY-SC } }

OPENACC Low Learning Curve Incremental Single Source ▪ Rebuild the same code ▪ Maintain existing on multiple sequential code architectures ▪ Add annotations to ▪ Compiler determines expose parallelism how to parallelize for ▪ After verifying the desired machine correctness, annotate ▪ Sequential code is more of the code maintained

OPENACC Parallel Hardware CPU Low Learning Curve ▪ OpenACC is meant to be easy to use, and The programmer will easy to learn give hints to the int main(){ ▪ Programmer remains compiler about which in familiar C, C++, or <sequential code> parts of the code to Fortran parallelize. Compiler #pragma acc kernels ▪ No reason to learn Hint The compiler will then { low-level details of the generate parallelism <parallel code> hardware. } for the target parallel hardware. }

OPENACC Low Learning Curve Incremental Single Source ▪ OpenACC is meant to ▪ Rebuild the same code ▪ Maintain existing be easy to use, and on multiple sequential code easy to learn architectures ▪ Add annotations to ▪ Programmer remains ▪ Compiler determines expose parallelism in familiar C, C++, or how to parallelize for ▪ After verifying Fortran the desired machine correctness, annotate ▪ No reason to learn ▪ Sequential code is more of the code low-level details of the maintained hardware.

OPENACC SUCCESSES COSMO LSDalton PowerGrid INCOMP3D CFD Quantum Chemistry Medical Imaging Weather and Climate NC State University Aarhus University MeteoSwiss, CSCS University of Illinois 12X speedup 40 days to 40X speedup 1 week 2 hours 3X energy efficiency 4X speedup MAESTRO NekCEM CloverLeaf FINE/Turbo CASTRO CFD Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA Stony Brook University Argonne National Lab AWE International 2.5X speedup 4X speedup 4.4X speedup 10X faster routines 60% less energy Single CPU/GPU code 4 weeks effort 2X faster app

OPENACC SYNTAX

OPENACC SYNTAX Syntax for using OpenACC directives in code C/C++ Fortran #pragma acc directive clauses !$acc directive clauses <code> <code> ▪ A pragma in C/C++ gives instructions to the compiler on how to compile the code. Compilers that do not understand a particular pragma can freely ignore it. ▪ A directive in Fortran is a specially formatted comment that likewise instructions the compiler in it compilation of the code and can be freely ignored. ▪ “ acc ” informs the compiler that what will come is an OpenACC directive ▪ Directives are commands in OpenACC for altering our code. ▪ Clauses are specifiers or additions to directives.

EXAMPLE CODE

LAPLACE HEAT TRANSFER Introduction to lab code - visual Very Hot Room Temp We will observe a simple simulation of heat distributing across a metal plate. We will apply a consistent heat to the top of the plate. Then, we will simulate the heat distributing across the plate.

EXAMPLE: JACOBI ITERATION ▪ Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. ▪ Common, useful algorithm ▪ Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 A(i,j-1) 4

JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from A[j-1][i] + A[j+1][i]); neighbors err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Swap input/output arrays } } iter++; } 21

PROFILE-DRIVEN DEVELOPMENT

OPENACC DEVELOPMENT CYCLE ▪ Analyze your code to determine most likely places needing Analyze Analyze parallelization or optimization. ▪ Parallelize your code by starting with the most time consuming parts and check for correctness. ▪ Optimize your code to improve observed speed-up from parallelization. Parallelize Optimize

PROFILING SEQUENTIAL CODE Profile Your Code Lab Code: Laplace Heat Transfer Obtain detailed information about how Total Runtime: 39.43 seconds the code ran. This can include information such as: ▪ Total runtime ▪ Runtime of individual routines swap calcNext 19.04s ▪ Hardware counters 21.49s Identify the portions of code that took the longest to run. We want to focus on these “hotspots” when parallelizing.

PROFILING SEQUENTIAL CODE First sight when using PGPROF ▪ Profiling a simple, sequential code ▪ Our sequential program will on run on the CPU ▪ To view information about how our code ran, we should select the “CPU Details” tab

PROFILING SEQUENTIAL CODE CPU Details ▪ Within the “CPU Details” tab, we can see the various parts of our code, and how long they took to run ▪ We can reorganize this info using the three options in the top-right portion of the tab ▪ We will expand this information, and see more details about our code

PROFILING SEQUENTIAL CODE CPU Details ▪ We can see that there are two places that our code is spending most of its time ▪ 21.49 seconds in the “ calcNext ” function ▪ 19.04 seconds in a memcpy function ▪ The c_mcopy8 that we see is actually a compiler optimization that is being applied to our “swap” function

PROFILING SEQUENTIAL CODE PGPROF ▪ We are also able to select the different elements in the CPU Details by double-clicking to open the associated source code ▪ Here we have selected the “calcNext:37” element, which opened up our code to show the exact line (line 37) that is associated with that element

OPENACC PARALLEL DIRECTIVE

OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang When encountering the parallel directive, the compiler will generate gang gang 1 or more parallel gangs , which execute redundantly. gang gang }

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March - PowerPoint PPT Presentation

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be covered What is OpenACC Profile-driven Development OpenACC with CUDA Unified Memory OpenACC Data Directives OpenACC Loop

FROM ZERO TO HERO: MARKETING FOR STARTUPS & GROWING COMPANIES #ACC2015 Lily Leung Twitter:

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hero Acquisitions Limited (subsidiary of HSS Hire Group plc) Q3 17 Results Agenda Hero

REQUEST TO INCREASE CONTRACT FOR HIGHWAY EMERGENCY RESPONSE OPERATOR (HERO) PROGRAM Terry G.

Hero Acquisitions Limited (subsidiary of HSS Hire Group plc) H1 FY16 Results Agenda Hero

The Hero's Journey (and YOU are the Hero!) The Healing Power of Telling Your Story Jeff Bell

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, Dept BME, Dept. Statistics

Standard 2-point Opposition: Hero Opponent 4-point Opposition Hero Opponent 1

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

Evidence of Climate Change in Minnesota: What our measurements show Dr. Mark Seeley Department

Minnesotas Changing Climate: A Review Dr. Mark Seeley Department of Soil, Water, and Climate

W O R L D W I D E Canada / USA Finland Germany Japan Lithuania Norway

STUDY OF THERMAL CRACKS IN CONCRETE STRUCTURES USING PROBABILITY THEORY Masami Ishikawa Tohoku

NEVADA OPEN ENERGY MARKET DESIGN & POLICY Maura Yates, Co-CEO/Founder Mothership Energy Group

Innovation in Climate Change Adaptation and Mitigation in Los Angeles Jonathan Parfrey Climate

ABX Indices The New US Asset Backed Credit Default Swap Benchmark Indices January 2006 CDS

Briefing on Climate Adaptation Interagency Climate Adaptation Team Minnesota Environmental

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March - PowerPoint PPT Presentation

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be covered What is OpenACC Profile-driven Development OpenACC with CUDA Unified Memory OpenACC Data Directives OpenACC Loop

FROM ZERO TO HERO: MARKETING FOR STARTUPS &amp; GROWING COMPANIES #ACC2015 Lily Leung Twitter:

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hero Acquisitions Limited (subsidiary of HSS Hire Group plc) Q3 17 Results Agenda Hero

REQUEST TO INCREASE CONTRACT FOR HIGHWAY EMERGENCY RESPONSE OPERATOR (HERO) PROGRAM Terry G.

Hero Acquisitions Limited (subsidiary of HSS Hire Group plc) H1 FY16 Results Agenda Hero

The Hero's Journey (and YOU are the Hero!) The Healing Power of Telling Your Story Jeff Bell

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, Dept BME, Dept. Statistics

Standard 2-point Opposition: Hero Opponent 4-point Opposition Hero Opponent 1

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

Evidence of Climate Change in Minnesota: What our measurements show Dr. Mark Seeley Department

Minnesotas Changing Climate: A Review Dr. Mark Seeley Department of Soil, Water, and Climate

W O R L D W I D E Canada / USA Finland Germany Japan Lithuania Norway

STUDY OF THERMAL CRACKS IN CONCRETE STRUCTURES USING PROBABILITY THEORY Masami Ishikawa Tohoku

NEVADA OPEN ENERGY MARKET DESIGN &amp; POLICY Maura Yates, Co-CEO/Founder Mothership Energy Group

Innovation in Climate Change Adaptation and Mitigation in Los Angeles Jonathan Parfrey Climate

ABX Indices The New US Asset Backed Credit Default Swap Benchmark Indices January 2006 CDS

Briefing on Climate Adaptation Interagency Climate Adaptation Team Minnesota Environmental

FROM ZERO TO HERO: MARKETING FOR STARTUPS & GROWING COMPANIES #ACC2015 Lily Leung Twitter:

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

NEVADA OPEN ENERGY MARKET DESIGN & POLICY Maura Yates, Co-CEO/Founder Mothership Energy Group