Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 - PowerPoint PPT Presentation

Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 David M. Dawson Lawrence Livermore National Laboratory March 19, 2019 LLNL-PRES-769074 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

LLNL has been heavily investing in performance on GPUs ▪ 17+ code projects/teams/organizations — Code development teams Code — Advanced architecture and portability specialists (AAPS) — Tool development teams Teams — Sierra Center of Excellence (CoE) — Livermore Computing — Vendors (IBM, Nvidia) ▪ 78 contributors… and counting! CoE AAPS ▪ 4+ years preparing for Sierra The expertise, creativity, and collaboration of our teams make technological advances like Sierra possible. 2 LLNL-PRES-769074

Porting strategies must address more than just performance ▪ Real codes – real challenges ▪ Future proof(ish) — Scale: millions of lines of code in multiple — Heterogeneity is likely here to stay … for a while anyway programming languages — We can’t afford to do this with every new machine — Continue to provide new capabilities to users — Reduce time to performance on new machines — Pedigree: maintain connection to prior V&V efforts • Greater utilization of these expensive investments — Libraries: coordinate use of limited memory resources ▪ Position ourselves for exascale success! ▪ Portable performance — Our codes must be fast, reliable, and accurate on multiple systems Laptops Workstations Commodity Sequoia/Trinity Sierra Exascale DOD & Industry Linux Clusters Advanced Architectures GPU Accelerated Emergency Response Teams El Capitan These considerations represent at least as great a cost as computational performance This is an opportunity to invest in future performance. 3 LLNL-PRES-769074

Lightweight mini-apps are used to study algorithmic behavior and facilitate collaboration with vendors and academia ▪ LLNL production code Export Controlled/UCNI — Million+ lines of code — Multiple languages Application — New features added regularly — Multiple physical processes interacting — Sensitive/proprietary Lessons ▪ Open source research application — Focused and lightweight Open Source — Single physics (few algorithms) Mini-App — Can be shared with vendors and academic collaborators • Facilitates performance optimization Mini-apps allow us to leverage vendor and academia expertise in optimizing our full production codes 4 LLNL-PRES-769074

A note on measuring performance CPU and GPU performance is difficult to measure ▪ All speedup numbers are node-to-node speedup as compared with CTS-1 — What users will generally experience ▪ Most of our codes are primarily memory-bandwidth bound on the CPU ▪ To set expectations, compare relevant effective memory bandwidths of architectures CTS-1 Sierra EA Sierra (Broadwell) (2 × P8 CPU + 4 × P100 GPU) (2 × P9 CPU + 4 × V100 GPU) DRAM bandwidth per node 130 GB/s 2,200 GB/s 3,400 GB/s 16.9 × 1.5 × L2 bandwidth per node 3,870 GB/s Shared memory bandwidth per node 31,052 GB/s 48,320 GB/s 1.6 × ▪ How does performance scale with relevant memory bandwidth? — This is not a perfect measure, but it is a good place to start Memory bandwidth is a first-order predictor of performance (as opposed to peak FLOPS). 5 LLNL-PRES-769074

The deterministic transport project is realizing significant performance gains through focused refactor and porting efforts ▪ Deterministic transport codes ▪ Enabling performant sweeps on GPUs — Ardra: particle transport was a significant challenge that had not — Teton: thermal radiative transfer previously been demonstrated ▪ Porting strategy — Memory requirements and algorithmic — Teton dependencies create technical challenges on • OpenMP 4.5 new architectures • CUDA-C — Ardra • RAJA, CHAI, Umpire Deterministic transport pushes memory requirements to the limits of the device. 6 LLNL-PRES-769074

Teton's computational performance is dominated by two kernels: Linear Solve (Sweep) and Non-linear Solve ▪ We have ported the linear solve ▪ We accept the risk (for now) of Temperature (Sweep) to GPUs maintaining separate CPU- and GPU- Iteration Loop specific versions of a small number — OpenMP 4.5 and CUDA-C Linear Solve of key algorithms ▪ We have ported the non-linear solve (Sweep) — Algorithms are tailored to the hardware to GPUs 50%-90% runtime Novel Solution Algorithm to maximize performance — CUDA-C Grey Acceleration ▪ Can we refactor code with a clever 5%-20% runtime ▪ Teton is Fortran (cannot use RAJA) abstraction layer and maintain only Synchronization Point — Fortran tools/compilers lag those of C/C++ one version? Non-Linear Solve (Thermal Iteration) 10 10%-50% runtime 8 Check Convergence 6 <1% runtime 4 Synchronization Point 2 0 2D 3D Sweep Non-Linear Other Overall We are exploring multiple porting strategies in full production code, including tradeoffs between CUDA-C and OpenMP 4.5. 7 LLNL-PRES-769074

Speedup is being measured with criticality solve Porting Performance Tuning Research ▪ Mini-app research 25 ⎯ Work with Sierra CoE to optimize algorithms ▪ Develop RAJA nested loops 20 ▪ Data structure refactor 15 Porting ▪ Transition code to RAJA/CHAI/Umpire 10 ▪ Performance poor because of significant data motion ▪ Aiming for correctness, not speed 5 Performance Tuning ▪ All kernels running on GPU 0 11/27/17 1/27/18 3/19/18 6/6/18 7/13/18 7/16/18 2/6/19 ▪ Data stays resident on GPU (except communication) ▪ Algorithms take advantage of GPU shared memory P100 V100 Focused and strategic porting of deterministic transport is yielding significant speedups. 8 LLNL-PRES-769074

Ardra performance tracks closely with cache bandwidth across architectures Criticality search solver Resources Nodes Runtime (s) Speedup (×) Runtime vs. Bandwidth 36 CPU cores 1 38.76 1.0 CTS-1 (Broadwell) Ideal 72 cores 2 18.57 2.1 Broadwell (L2) 40 144 cores 4 8.95 4.3 P100 (Shared) Runtime (s) V100 (Shared) 288 cores 8 5.03 7.7 EA (P8+P100) 4 P100 GPUs 1 4.69 8.3 4 8 GPUs 2 2.56 15.1 16 GPUs 4 1.39 27.8 4 V100 GPUs 1 3.13 12.4 0.4 Sierra (P9+V100) 3.E+03 3.E+04 3.E+05 8 GPUs 2 1.73 22.4 Aggregate memory bandwidth (GB/s) 16 GPUs 4 1.08 35.8 32 GPUs 8 0.77 50.5 9 LLNL-PRES-769074

The Mercury particle transport and Imp IMC thermal radiative transfer capabilities have been ported to Sierra ▪ Particle (Mercury) and thermal photon (Imp) Dynamic Heterogeneous Load Balancing transport consolidated into single code base ▪ Uses speed information from previous cycle to — Built from shared infrastructural source code balance the particle workload among all ranks — Facilitated GPU port ▪ Performance limited by longest running rank ▪ History-based Monte Carlo transport is generally hostile to most advanced architectures ▪ Early tests show up to 3 × speedup — Particle tracking loop is thousands of lines of branchy, latency-sensitive code ▪ GPU porting strategy 6 Non-load Balanced Load Balanced 5 — CUDA "big kernel" history-based particle tracking 4 Wall Time (s) with CUDA managed memory 3 — Exploring RAJA for more typical "loops over cells" 2 code 1 ▪ Targeting 2-3 × speedup on Sierra 0 0 1 2 3 4 0 1 2 3 4 — Based on mini-app results Rank Rank Monte Carlo transport capabilities are entering the performance tuning phase and exploring heterogeneous load balancing. 10 LLNL-PRES-769074

We are assessing Imp and Mercury performance on Sierra ▪ Crooked pipe idealized thermal radiative transfer test problem Optically thick — 2 × speedup overall Optically thin Source T e @ 10 -6 s — Particle tracking showing decent speedup Pt 3 Total Time Particle Time Init/Final Time Pt 2 Pt 1 Pt 4 Resources CPU / GPU Pt 5 [minutes] [minutes] [minutes] CTS-1 36 cores 31.67 29.61 2.05 V100+P9 4 GPUs + 36 cores 15.88 (1.99 × ) 11.70 (2.53 × ) 4.18 (0.49 × ) ▪ Godiva critical sphere surrounded by water, criticality solve — 1.1 × overall speedup Total Time Particle Time Init/Final Time Resources CPU / GPU [minutes] [minutes] [minutes] CTS-1 36 cores 2.53 2.27 0.26 V100+P9 4 GPUs + 36 cores 2.28 (1.11 × ) 1.83 (1.24 × ) 0.45 (0.58 × ) Monte Carlo transport on GPUs is hard, but progress is being made. * D. E. Cullen, C. J. Clouse, R. Procassini , R. C. Little, “Static and Dynamic Criticality: 11 Are They Different,” UCRL -TR-201506 (2003) LLNL-PRES-769074

HE Performance, Lethality, Vulnerability and Safety Code DoD: Munitions and rocket design performance, lethality, ▪ Required physics capabilities vulnerabilities, and safety — 3D/2D ALE hydrodynamics — 3D arbitrarily connected hexahedral mesh — High-explosive modeling — Material contact — Advanced material models Buried Blast Rocket Motor Rail Gun Blast/Impact for TBI DOE: Stockpile Stewardship, Other: Additive Manufacturing DHS: Transit and structures vulnerabilities and safeguard other NNSA programs designs Glory Mission and Taurus XL Launch Component and System- Explosive Cookoff Fully Coupled Level Analysis Violence of Reaction Blast/Structural The goal is to turn current month-long complex calculations around in a weekend (10× speedup or more). 12 LLNL-PRES-769074

Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 - PowerPoint PPT Presentation

Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 David M. Dawson Lawrence Livermore National Laboratory March 19, 2019 LLNL-PRES-769074 This work was performed under the auspices of the U.S. Department of Energy by Lawrence

Using RAJA for Accelerating LLNL Production Applications on the Sierra Supercomputer GTC 2018,

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Status of update of G4LEND Koi, Tatsumi (SLAC) Beck, Bret (LLNL) Hiller, Larry (LLNL) Caleb,

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Socorro/Sierra Socorro/Sierra Regional Water Plan Regional Water Plan Presentation to

Reclaiming the Sierra Elizabeth Izzy Martin CEO The Sierra Fund Original feather picture

2019 MAYOR'S STATE OF THE CITY ADDRESS City of Sierra Madre The Golden Age of Sierra Madre

Development of immersion gratings and grisms at LLNL Dispersing elements for astronomy: New

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Mount Auburn Street Corridor Study June 1, 2016 Shady Hill School Gym Commonwealth of

LOCATION MAP Project Bakersfield Location Baker Kramer Junction Interstate 40 Needles

Applying the Team Software Process Noopur Davis, SEI Bruce Erickson, Intuit SEPG 2005 Seattle,

Allston Brighton Boston College Task Force Meeting Brighton Marine Health Center March 27, 2013

IN IMPERIAL IN IMPERIAL BEA BEACH CH Wildseabook.com JANU ANUAR ARY Y 1983 El Nio 1983

Glen Ridge High School February 7, 2018 Louis A. Melchor AP Capstone Program Model A two-course

Carlisle Public School FY21 School Committee Budget Presentation October 16, 2019 Budget

Stormwater Utility Feasibility Report City of Newark, Del. Stormwater is drinking water in

Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 - PowerPoint PPT Presentation

Lessons Learned from Porting LLNL Applications to Sierra GTC 2019 David M. Dawson Lawrence Livermore National Laboratory March 19, 2019 LLNL-PRES-769074 This work was performed under the auspices of the U.S. Department of Energy by Lawrence

Using RAJA for Accelerating LLNL Production Applications on the Sierra Supercomputer GTC 2018,

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Status of update of G4LEND Koi, Tatsumi (SLAC) Beck, Bret (LLNL) Hiller, Larry (LLNL) Caleb,

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

Socorro/Sierra Socorro/Sierra Regional Water Plan Regional Water Plan Presentation to

Reclaiming the Sierra Elizabeth Izzy Martin CEO The Sierra Fund Original feather picture

2019 MAYOR'S STATE OF THE CITY ADDRESS City of Sierra Madre The Golden Age of Sierra Madre

Development of immersion gratings and grisms at LLNL Dispersing elements for astronomy: New

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President &amp; CEO

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Mount Auburn Street Corridor Study June 1, 2016 Shady Hill School Gym Commonwealth of

LOCATION MAP Project Bakersfield Location Baker Kramer Junction Interstate 40 Needles

Applying the Team Software Process Noopur Davis, SEI Bruce Erickson, Intuit SEPG 2005 Seattle,

Allston Brighton Boston College Task Force Meeting Brighton Marine Health Center March 27, 2013

IN IMPERIAL IN IMPERIAL BEA BEACH CH Wildseabook.com JANU ANUAR ARY Y 1983 El Nio 1983

Glen Ridge High School February 7, 2018 Louis A. Melchor AP Capstone Program Model A two-course

Carlisle Public School FY21 School Committee Budget Presentation October 16, 2019 Budget

Stormwater Utility Feasibility Report City of Newark, Del. Stormwater is drinking water in

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO