LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 - PowerPoint PPT Presentation

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 Peter Robinson 2 1 University of Minnesota 2 Lawrence Livermore National Laboratory, Weapons and Complex Integration This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. August 21, 2013 Shaden Smith, Peter Robinson LLNL-PRES-642574

1. Introduction and Motivations 2. OpenACC 3. Challenges 4. Methodologies and Results 5. Conclusions Shaden Smith, Peter Robinson LLNL-PRES-642574

Exascale Architectures Heterogeneity Supercomputers will no longer have simple, homogeneous nodes with many CPU cores GPUs and other accelerators are dominating the horsepower of new systems Sequoia Titan Tianhe-2 PFLOPS 17.17 17.59 33.86 Architecture BG/Q AMD CPU + NVIDIA GPU Intel CPU + MIC Nodes/Cores 98.30K / 1.57M 18.68K / 0.56M 16.00K / 3.12M Power 7.89MW 8.20MW 17.80MW Shaden Smith, Peter Robinson LLNL-PRES-642574

Graphics Processing Units GPU Overview GPUs are massively parallel accelerators designed for graphics processing Very good at stream processing Scan over a large list of data, doing identical math on each index The CPU and GPU do not share memory The programmer must maintain copies on both CPU GPU 1 1 2 2 3 3 4 4 1 2 3 4 Shaden Smith, Peter Robinson LLNL-PRES-642574

Proxy Applications Motivation Rewriting a large simulation code is a major investment Instead, extract a small but representative portion Can be modified and also released for public use Great for hardware co-design! Proxy Apps AMG2013 LULESH MCB UMT Shaden Smith, Peter Robinson LLNL-PRES-642574

L ivermore U nstructured L agrange E xplicit S hock H ydrodynamics LULESH Overview Data layout, memory access patterns, and computation are very similar to a typical multi-physics code’s hydro kernel Only a few thousand lines of code, so it’s easy to rewrite for new architectures and programming models Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction What is OpenACC? C/C++/Fortran API that supports offloading work to accelerator devices Uses pragmas to provide the compiler hints for parallel regions Familiar interface for OpenMP programmers! /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 A: 1 2 3 4 A: B: 4 3 2 1 B: C: 0 0 0 0 C: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 Alloc + Copy A A: 1 2 3 4 A: 1 2 3 4 B: 4 3 2 1 B: C: 0 0 0 0 C: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 A: 1 2 3 4 A: 1 2 3 4 Alloc + Copy B B: 4 3 2 1 B: 4 3 2 1 C: 0 0 0 0 C: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 A: 1 2 3 4 A: 1 2 3 4 B: 4 3 2 1 B: 4 3 2 1 Alloc C C: 0 0 0 0 C: ? ? ? ? CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 A: 1 2 3 4 A: 1 2 3 4 B: 4 3 2 1 B: 4 3 2 1 C: 0 0 0 0 C: 4 6 6 4 CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Introduction /* A, B, and C currently on CPU */ 1 #pragma acc parallel loop copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 for(int i = 0; i < N; ++i) { 5 C[i] = A[i] * B[i]; 6 } 7 A: 1 2 3 4 A: 1 2 3 4 B: 4 3 2 1 B: 4 3 2 1 Copy C C: 4 6 6 4 C: 4 6 6 4 CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Data Movement Data Regions Data regions provide a means of specifying memory transfers Minimizing data movement between the CPU and accelerator is essential for performance /* A, B, and C allocated on CPU */ 1 #pragma acc data copyin(A[0:N], \ 2 B[0:N]) \ 3 copyout(C[0:N]) 4 { 5 /* A, B, and C are now on accelerator */ 6 compute_C(A,B,C); 7 compute_more_C (A,B,C); 8 } 9 /* C has now been updated on CPU */ 10 Shaden Smith, Peter Robinson LLNL-PRES-642574

OpenACC - Availability Compiler Support Three compilers have implementations of OpenACC PGI, CAPS, Cray Our code has only been tested with PGI thus far LLNL Support edge and rzgpu both have pgi-accelerator available Compile on edge84 and rzgpu2 Shaden Smith, Peter Robinson LLNL-PRES-642574

Data Management Implicit Data Regions When functions are called from within a data region, the programmer must be aware of which memory is found on the accelerator It’s easy to forget where your data is and instead access junk #pragma acc data copyin(A[0:N], \ 1 B[0:N]) \ 2 copyout(C[0:N]) 3 { 4 compute_C(A,B,C); 5 print_intermediate_results (C); /* OUCH! */ 6 compute_more_C (A,B,C); 7 } 8 Shaden Smith, Peter Robinson LLNL-PRES-642574

Maturing Standard Thread-Local Arrays The OpenACC standard currently doesn’t say what to do with local arrays in accelerated regions As of pgcc v13.6, these are treated as a shared resource among threads Before After 1 for(Index_t i = 0; i < N; ++i) { 2 Real_t scratch [4]; 3 for(Index_t j = 0; j < 4; ++j) { 4 scratch[j] = x[i*4 + j]; 5 } 6 7 8 9 10 11 12 /* do work */ 13 } Shaden Smith, Peter Robinson LLNL-PRES-642574

Maturing Standard Thread-Local Arrays The OpenACC standard currently doesn’t say what to do with local arrays in accelerated regions As of pgcc v13.6, these are treated as a shared resource among threads Before After 1 for(Index_t i = 0; i < N; ++i) { #pragma acc parallel loop copy(x[0:N*4]) 2 Real_t scratch [4]; for(Index_t i = 0; i < N; ++i) { 3 for(Index_t j = 0; j < 4; ++j) { Real_t scratch0; 4 scratch[j] = x[i*4 + j]; Real_t scratch1; 5 } Real_t scratch2; 6 Real_t scratch3; 7 8 scratch0 = x[i*4 + 0]; 9 scratch1 = x[i*4 + 1]; 10 scratch2 = x[i*4 + 2]; 11 scratch3 = x[i*4 + 3]; 12 /* do work */ /* do work */ 13 } } Shaden Smith, Peter Robinson LLNL-PRES-642574

Compiler Optimizations Runtime Errors Class members are often extracted before entering a data region Currently you cannot access members within a pragma If these are not made volatile , they will be optimized away volatile Real_t *x = domain.x(); 1 Real_t *y = domain.y(); /* y is optimized away */ 2 #pragma acc data copyin(x[0:N], \ 3 y[0:N]) /* runtime error! */ 4 { 5 accelerated_physics (domain ); 6 } 7 Shaden Smith, Peter Robinson LLNL-PRES-642574

Compiler Optimizations volatile Real_t *x = domain.x(); 1 Real_t *y = domain.y(); /* y is optimized away */ 2 #pragma acc data copyin(x[0:N], \ 3 y[0:N]) /* runtime error! */ 4 { 5 accelerated_physics (domain ); 6 } 7 x: x: y: y: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

Compiler Optimizations volatile Real_t *x = domain.x(); 1 #pragma acc data copyin(x[0:N], \ 3 y[0:N]) /* runtime error! */ 4 { 5 accelerated_physics (domain ); 6 } 7 x: x: y: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

Compiler Optimizations volatile Real_t *x = domain.x(); 1 #pragma acc data copyin(x[0:N], \ 3 y[0:N]) /* runtime error! */ 4 { 5 accelerated_physics (domain ); 6 } 7 x: 0 2 0 2 x: y: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

Compiler Optimizations volatile Real_t *x = domain.x(); 1 #pragma acc data copyin(x[0:N], \ 3 y[0:N]) /* runtime error! */ 4 { 5 accelerated_physics (domain ); 6 } 7 Alloc + Copy x x: 0 2 0 2 x: 0 2 0 2 y: CPU Accelerator Shaden Smith, Peter Robinson LLNL-PRES-642574

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 - PowerPoint PPT Presentation

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 Peter Robinson 2 1 University of Minnesota 2 Lawrence Livermore National Laboratory, Weapons and Complex Integration This work performed under the auspices of the U.S. Department of

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus,

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Efficiency of Contracts for Differences (CfDs) in the Nordic Electricity Market Petr Spodniak,

Beam energy scan using a viscous hydro+cascade model Yuriy KARPENKO Frankfurt Institute for

Capital Expenditure & Project Review BC H YDRO C APITAL E XPENDITURES AND P ROJECTS R EVIEW E

BC Hydros Clean Electricity Advantage Pembina Clean Futures Forum February 28, 2019 The

C. K. Li 25th Fusion Energy Conference St Petersburg, Russia 13 -18 Oct. 2014 Summary

Empowering a Rice Mill Bhamane (India) Executive Summary Powering a rice mill in Bhamane

Bonneville Power Administration Overview As of July 7, 2016 Grand Coulee Dam High Voltage

Valley Clean Energy Special Board Meeting August 13, 2020 Via Teleconference Item 14 IRP

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 - PowerPoint PPT Presentation

LULESH and OpenACC: To Exascale and Beyond!!! Shaden Smith 1 , 2 Peter Robinson 2 1 University of Minnesota 2 Lawrence Livermore National Laboratory, Weapons and Complex Integration This work performed under the auspices of the U.S. Department of

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus,

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Efficiency of Contracts for Differences (CfDs) in the Nordic Electricity Market Petr Spodniak,

Beam energy scan using a viscous hydro+cascade model Yuriy KARPENKO Frankfurt Institute for

Capital Expenditure &amp; Project Review BC H YDRO C APITAL E XPENDITURES AND P ROJECTS R EVIEW E

BC Hydros Clean Electricity Advantage Pembina Clean Futures Forum February 28, 2019 The

C. K. Li 25th Fusion Energy Conference St Petersburg, Russia 13 -18 Oct. 2014 Summary

Empowering a Rice Mill Bhamane (India) Executive Summary Powering a rice mill in Bhamane

Bonneville Power Administration Overview As of July 7, 2016 Grand Coulee Dam High Voltage

Valley Clean Energy Special Board Meeting August 13, 2020 Via Teleconference Item 14 IRP

Capital Expenditure & Project Review BC H YDRO C APITAL E XPENDITURES AND P ROJECTS R EVIEW E