MPAS on GPUs Using OpenACC: Portability, Scalability & - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research March 2018

Outline • Motivation & Goals • System & Software Specs • Approach & Challenges • Results • Future Plans • Questions 2

Motivation & Goals • Motivation o Shallow water equation solver & Discrete Galerkin Kernel • Showed promising results on GPUs • Unified code was possible- at least for small scale applications • Goals o Port MPAS on to GPUs o Optimize performance on GPUs • No compromise on portability: < 10% of original performance o Scale MPAS on GPUs 3

System Specs • NVIDIA’s Internal Clusters o PSG: Dual socket Haswell (32 cores/node) with 4 P100/node, 12 nodes, intra-node PCIe connected, inter-node FDR o PSG: Dual socket Haswell (32 cores/ node) with 4 V100/node, 2 nodes, intra-node PCIe connected, inter-node FDR • NCAR’s Cheyenne supercomputer o Dual socket Broadwell (36 cores/node), 4,032 nodes • IBM’s R92 Cluster (Internal) o Minsky- Dual socket Power8 (20 cores/node) with 4 P100/node, 90+ nodes, intra-node NVLink, inter-node IB 4

Software Spec: MPAS Dry Dynamical Core • Software o MPAS Release (or MPAS 5.2 ) o Intel Compiler 17.0, PGI Compiler 17.10 • Dry Baroclinic Instability Test- No physics, no scalar transport o Dry dynamics test-case produces baroclinic storms from analytic initial conditions o Split Dynamics: 2 sub-steps, 3 split steps o 120 km (40k grid points, dt=720s) , 60 km resolution (163k grid points, dt=360s), 30 km resolution (655k grid points, dt=180s), 15 km resolution (2.6m grid points, dt=60s) o Number of levels = 56 o Double precision (DP) and Single precision (SP) o Simulation executed for 16 days, performance shown for 1 timestep 5

Why did we choose Dycore? MicroPhysics MPAS Physics scheme WSM6(9.62%) Boundary Layer Dynamic Core YSU(1.55%) Execution time - Physics: 45-50% Gravity Wave Drag DyCore: 50-55% GWDO(0.71%) Lines of Code - Physics: 110,000 Radiation Short Wave DyCore: 10,000 RRTMG_SW(18.83%) Radiation Long Wave RRTMG_LW(16.43%) Convection New Tiedtke(4.19%) Flow Diagram by KISTI 6

Dycore: Zoom in MPAS-5 120 km case with Intel Compiler running 36 MPI ranks on Intel E5V2697v4 “Broadwell” node Integration Setup Moist coefficients 1% imp_coef 0% 1% MPI 8% substep 5% dyn_tend 32% diagnostics 20% large_step small_step 16% 4% acoustic_step 13% 7

Approach OpenACC KGen Directives Software & Architecture KGen Porting Baseline Optimize Configuration & Accuracy Profile & Analyze Benchmark Testing Integrate Portability Check Code Verification Refactoring

Challenges Faced: Using the Right Directives Lower time for porting, Reasonable performance Much higher time for porting, Improved performance depending on the loop count- up to 50% 9

Challenges Faced: Using the Right Approach for Data subroutine dyn_tend Lower time for porting, !$acc enter data Repeated unnecessary data transfer- poor !$acc data copy( … ) performance … <code> … !$acc exit data end subroutine dyn_tend subroutine dyn_tend !$acc data present( … ) Create a copy on Host and Device … simultaneously. <code> Harder to design, No unnecessary copies of data … between host and device end subroutine dyn_tend 10

Challenges Faced: Using the Right Techniques • Understand how to optimize o Learn the basics of optimization • Understand how GPUs work o Poor performance snippets o Know when to use global, register and shared memory o Nvprof is your best friend! • Understand how GPUs and CPUs work o Experiment with SIMD friendly loops- code layout o Experiment with GPU’s SIMT code - data layout o Learn how to combine the two! 11

Single Node Performance: MPAS Dry Dycore • Timers o MPAS GPTL timers reported in log files • GPU Timing : Has no updates from device to host o Host updates maybe needed for printing values on screen o Host updates maybe needed for netcdf file output Broadwell (Fully Subscribed, P100 with Power8(1 GPU, Speedup Speedup P100 with Haswell(1 GPU, PGI V100 with Haswell (1 GPU, PGI Dataset OpenMP Enabled, Intel compiled, PGI compiled, OpenACC Broadwell vs Broadwell vs compiled, OpenACC code) compiled, OpenACC code) Base code) code) P100 V100 0.40 0.28 0.19 0.26 1.54 2.16 SP 120 Km (40K) 0.88 0.40 0.29 0.35 2.51 2.99 DP 1.90 1.02 0.69 1.01 1.88 2.74 SP 60 Km (163K) 3.80 1.54 1.12 1.41 2.70 3.40 DP Taking 40k data points per node for SP, 32.8M grid points (15 Km & 3 Km Locally 12 refined grid) need ~800 Volta GPUs

Weak Scaling for MPAS Dry Dycore (SP & DP) on P100 GPU 2.2 2.0 1.8 1.6 1.4 Time (secs) 1.2 1.0 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of GPUs 40k per node, SP 40k per node, DP 163k per node, SP 163k per node, DP Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI 13 ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10

Strong Scaling for MPAS Dry Dycore (SP & DP) for 15 Km (2.6M) on P100 GPU 1.2 1.0 0.8 Inverse Time 0.6 0.4 0.2 0.0 8 16 20 24 28 32 Number of GPUs SP DP Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10. 14

Portability: Performance Comparison of Base code with OpenACC code on Fully Subscribed Broadwell single node 4.0 3.5 3.0 2.5 Time (secs) 2.0 1.5 1.0 0.5 0.0 40k SP 40k DP 163k SP 163k DP Dataset Base Code OpenACC Code For datasets up to 40k per node, the execution time is identical. For 40k & 163k, the variation is <1% & <4% respectively. 15

Future Work • Improving MPAS Scalability MVAPICH instead of OpenMPI o NVLink systems o Moving halo exchange book-keeping on GPUs o MPS: Preliminary results showed no improvement o • Scalar Transport Currently being integrated o • Physics Port: 35% remaining o Optimize: 65% remaining o Radiation, Land surface on CPU o • Development Lagged radiation o Adopting Sion library for faster IO o 16

Thank you! Questions? 17

MPAS on GPUs Using OpenACC: Portability, Scalability & - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research March 2018 Outline

Number Portability Three kinds of number portability Location portability: a subscriber may move

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II Special Technical Projects (STP)

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

How are MPAs effectively managed and monitored? ferdinando boero University of Salento,

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

High Baroclinic Equatorial Kelvin Waves and Central Pacific Surface Warming Peter C Chu Naval

Conference "CITES-2011"' , Tomsk, 2011 On Influence of a variation of heating sources

WEL ELCOME E to De Delta Desc scaling SOLUTIONS Before Delta T Descaling After Delta T

Developing a Marine Pest Pathways Plan for Fiordland Richard Bowman, Shaun Cunningham and Rebecca

Expansion of the Southern Hemisphere Tropics Chris Lucas, Hanh Nguyen, Bertrand Timbal

Effective diffusivities in a two-layer, isopycnal, wind-driven basin model Yue-Kin Tsang Shafer

coastal circulation in the Caribbean and the Virgin Islands using ROMS M. Solano 1 , J. Capella 3

How a state-of-the-art w ind atlas is m ade: The exam ple of the W ind Atlas for South Africa

MPAS on GPUs Using OpenACC: Portability, Scalability & - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research March 2018 Outline

Number Portability Three kinds of number portability Location portability: a subscriber may move

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II Special Technical Projects (STP)

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

How are MPAs effectively managed and monitored? ferdinando boero University of Salento,

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

High Baroclinic Equatorial Kelvin Waves and Central Pacific Surface Warming Peter C Chu Naval

Conference &quot;CITES-2011&quot;' , Tomsk, 2011 On Influence of a variation of heating sources

WEL ELCOME E to De Delta Desc scaling SOLUTIONS Before Delta T Descaling After Delta T

Developing a Marine Pest Pathways Plan for Fiordland Richard Bowman, Shaun Cunningham and Rebecca

Expansion of the Southern Hemisphere Tropics Chris Lucas, Hanh Nguyen, Bertrand Timbal

Effective diffusivities in a two-layer, isopycnal, wind-driven basin model Yue-Kin Tsang Shafer

coastal circulation in the Caribbean and the Virgin Islands using ROMS M. Solano 1 , J. Capella 3

How a state-of-the-art w ind atlas is m ade: The exam ple of the W ind Atlas for South Africa

Conference "CITES-2011"' , Tomsk, 2011 On Influence of a variation of heating sources