porting c++ applications to gpu with openacc for lattice quantum - PowerPoint PPT Presentation

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle 1 , Kate Clark 2 , Carleton DeTar 3 , Meifeng Lin 4 , Verinder Rana 4 , Alejandro Vaquero 3 1 University of Ediburgh 2 NVIDIA 3 University of Utah 4 Brookhaven National Laboratory San Jose, CA, May 8-11, 2017

outline: why, what and how 1. Lattice QCD 2. Grid C++ Data Parallel Library 3. Porting Grid to GPU 4. Performance 5. Conclusions 2

lattice qcd

lattice quantum chromodynamics (qcd) gluons' quarks' ⇒ ▶ Lattice QCD is a numerical framework to simulate quarks and gluons, the fundamental particles involved in strong interactions, from the theory of QCD. ▶ It is formulated on a discrete four-dimensional space-time grid or lattice. ▶ Quarks live on the lattice sites, and can propagate through the gluon ”lattice links”. ▶ Monte Carlo simulations are performed to generate the quantum fields of the gluons or ”the gauge field ensemble”. ▶ Complex calculations are made on these gauge ensembles to obtain physics results of relevance to experiments or other theoretical predictions. 4

D ij 4 lattice qcd compute kernel ▶ The core kernel of lattice QCD is matrix vector multiplications - the so-called Dslash operator. ∑ [ αβ ( x , y ) ψ j β ( y ) = ( 1 − γ µ ) αβ U µ ij ( x ) δ x +ˆ µ, y µ = 1 ] + ( 1 + γ µ ) αβ U † ij ( x + ˆ µ ) δ x − ˆ ψ j β ( y ) µ µ, y ▶ x , y - regular 4-dimensional grid points. ▶ γ µ - 4 × 4 matrices (fixed). ▶ U µ ( x ) - complex SU(3) matrices. ▶ ψ ( y ) - complex 12-component vectors. gluons' quarks' ▶ The Dslash operations make up 70-90% of the computation time. 5

other components of lattice qcd ▶ Numerical algorithms ▶ Monte Carlo sampling: Metropolis, Heatbath, ... ▶ Molecular Dynamics (combined with Monte Carlo → Hybrid Monte Carlo) ▶ Linear equation solvers: Ax = b ▶ Eigenvalue solvers: Ax = λ x ▶ Physics applications ▶ Actions: discretization schemes for the quarks and gluons ▶ Measurements: evaluation of Feynman-diagram like graphs. τ τ t t’ t t’ 6

from lattice to reality ▶ Lattice QCD can calculate the properties of the fundamental particles such as the neutron and the proton from first-principles. ▶ Requires enormous amount of computing power: Years of running on leadership-class supercomputers such as Mira and Titan. ▶ Higher computing efficiency and/or more computing power will allow us to reach higher precision and perform more complex calculations. ▶ Example results: 10 ΔΣ experiment 8 QCD+QED ΔΞ prediction 6 Δ M [MeV] Δ D 4 ΔΞ cc Δ N 2 Δ CG 0 BMW 2014 HCH Particle masses calculated from LQCD Mass differences of isospin multiplets S. Dürr, et al. , Science (2008) S. Borsanyi, et al. , Science (2015) 7

going into exascale Physics Objectives ▶ Increase the precision of certain critical calculations to understand fundamental symmetries in high-energy physics by an order of magnitude. ▶ Extend the calculations of the light nuclei and multi-nucleon systems in nuclear physics with quark masses that are closer to their values in nature. Software Requirements ▶ Efficiency : Should be able to efficiently exploit the expected multiple levels of parallelism on the exascale architectures. Need to conquer the communication bottleneck. ▶ Flexibility : Should be flexible for the users to implement different algorithms and physics calculations, and can provide easy access to multi-layered abstractions for the users. ▶ Performance Portability : Should be portable to minimize code changes for different architectures while maintaining competitive performance. 8

grid c++ data parallel library

grid ▶ Grid 1 is a next-generation C++ lattice QCD library being developed by Peter Boyle, Guido Cossu, Antonin Portelli and Azusa Yamaguchi at the University of Edinburgh. https://github.com/paboyle/Grid ▶ Originally developed and optimized for CPUs. Being used as a testbed for QCD ECP performance portability. ▶ It uses new features in C++11 for abstractions and programming flexibility. ▶ Data layout designed to match CPU SIMD lanes. ▶ Vector data layout: Decompose four-dimensional grids into sub-domains that map perfectly onto the target SIMD length. Virtual nodes SIMD vector Overdecomposed physical node 1 Peter Boyle et al. “Grid: A next generation data parallel C++ QCD library”. In: (2015). arXiv: 1512.03487 [hep-lat] . 10

typedef Grid_simd< std::complex< float > , SIMD_Ftype > vComplexF; typedef Grid_simd< std::complex< double >, SIMD_Dtype > vComplexD; typedef Grid_simd< double, SIMD_Dtype > vRealD; typedef Grid_simd< float, SIMD_Ftype > vRealF; // Abstract Data Types #endif #include ”Grid_avx512.h” typedef Grid_simd< Integer, SIMD_Itype > vInteger; #endif #include ”Grid_avx.h” #endif #include ”Grid_sse4.h” #endif #include ”Grid_generic.h” #ifdef GEN //Vectorization vectorization for cpus ▶ Vectorization is achieved in different ways on different targets, either using intrinsics, or explicit short scalar loops for compiler vectorization, and possibly using OpenMP SIMD pragmas depending on target. ▶ But the implementation details are abstracted inside templated data types. #ifdef SSE4 #if defined(AVX1) || defined (AVXFMA) || defined(AVX2) || defined(AVXFMA4) #if defined AVX512 11

#define PARALLEL_NESTED_LOOP2 PARALLEL_FOR_LOOP for(int ss=0;ss<lhs._grid->oSites();ss++){ ret._odata[ss] = trace(lhs._odata[ss]); } #ifdef GRID_OMP #include <omp.h> #define PARALLEL_FOR_LOOP _Pragma(”omp parallel for ”) #define PARALLEL_NESTED_LOOP2 _Pragma(”omp parallel for collapse(2)”) #else #endif parallelism in grid ▶ Grid uses OpenMP for on-node threading and MPI for inter-node communications. ▶ Lattice-wide operations are done in a big for loop over the outer lattice sites. ▶ PARALLEL_FOR_LOOP is a macro currently defined as an OpenMP parallel construct. It potentially can be replaced with OpenACC for GPU. #define PARALLEL_FOR_LOOP 12

porting grid to gpu

for(int ss=0;ss<sites;ss++){ template<class Impl> int sF = s+Ls*sU; #pragma acc kernels default(present) // generate OpenACC kernels Kernels::DiracOptDhopSite(st,U,st.comm_buf,sF,sU,in,out);}} // the routine being called is also very complicated #pragma acc routine seq void WilsonKernels<Impl>::DiracOptDhopSite(StencilImpl &st,DoubledGaugeField &U, for(int s=0;s<Ls;s++){ std::vector<SiteHalfSpinor,alignedAllocator<SiteHalfSpinor> > &buf, int sF,int sU,const FermionField &in, FermionField &out) { // implementation here, which calls other routines } int sU=ss; porting grid with openacc Why OpenACC? ▶ OpenACC, as a directive-based approach, is easy to introduce in any existing code, therefore it should score high in portability. ▶ OpenACC admits compilation for a large number of targets, such as AMD GPUs or even multicore computers. ▶ PGI’s OpenACC implementation supports Universal Virtual Memory (UVM), simplifying data movement and potentially eliminating the need for deep copy. Naïve Top-Down Approach : ▶ Declare acc kernels in the key compute kernels. Did not work due to complicated call structures. ▶ Use of C++ STL container types also complicates things, as not all STL functions have device versions. 14

} _odata[ss] = splatme; #pragma acc parallel loop independent copyin(expr[0:1]) int _osites=this->Osites(); { LatticeBinaryExpression<Op,T1,T2> expr) template <typename Op, typename T1,typename T2> inline Lattice<obj> & operator=(const } return *this; } for(int ss=0;ss<_osites;ss++){ _odata[ss] = eval(ss,expr); #pragma acc parallel loop independent copyin(splatme[0:1]) int _osites=this->Osites(); { inline Lattice<obj> & operator= (const obj & splatme) return *this; } pgc++ -I. -acc -fast -ta=tesla:managed –c++11 -Minfo=accel -O3 main.cc -o gpu.x for(int ss=0;ss<_osites;ss++){ bottom-up: start with the expression template engine ▶ Grid Expression Template Engine (P. Boyle): 200 lines of code, self-contained, captures a lot of language features in Grid. ▶ OpenACC + Manual Deep Copy did not work due to the multi-layer, nested data structures. ▶ PGI’s UVM support came to the rescue! ▶ WIth PGI 16.10: 15

OFFLOAD #else friend inline SuN<vFloat,N> operator+=(SuN<vFloat,N> lhs, const SuN<vFloat,N> &rhs) { OFFLOAD // code omitted ... public: vFloat A[N*N]; private: SuN { class template<class vFloat,const int N> #endif OFFLOAD #define __host__ __device__ } OFFLOAD #define #elif defined (__NVCC__) _Pragma(”acc routine seq”) OFFLOAD #define #ifdef _OPENACC //-------------- // suN.h //-------------- friend inline SuN<vFloat,N> operator*=(SuN<vFloat,N> lhs, const SuN<vFloat,N> &rhs) { // matrix multiply code } // matrix add code su(3)xsu(3) mini-app ▶ Interesting as it tests the streaming bandwidth typical for Lattice QCD calculations. Also a component of lattice gauge action calculations. ▶ Custom SU(3) class implementation with OpenACC markup. Used as a data type for the ET engine. [A. Vaquero] 16

porting c++ applications to gpu with openacc for lattice quantum - PowerPoint PPT Presentation

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle 1 , Kate Clark 2 , Carleton DeTar 3 , Meifeng Lin 4 , Verinder Rana 4 , Alejandro Vaquero 3 1 University of Ediburgh 2 NVIDIA 3 University of Utah 4

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Porting Maxwell to the GPU Top Challenges Juan Caada Head of Visualization Next Limit

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Refinement-based Verification of FreeRTOS in VCC 1 Sumesh Divakaran Department of Computer

Abstract Data Types EECS 214, Fall 2017 What is an ADT? An ADT defjnes: An ADT omits: How the

Introduction to Algebraic Topology Constructing Spaces Homotopy The Jack Romo Fundamental

Abstract Interpretation and Program Verification Supratik Chakraborty IIT Bombay Program

Introducing Design Pattern-based Abstract Modeling Construct as a Software Architecture

Operators and Abstraction Announcements Reading Regrades Today is last day to request

Weakly Supervised Semantic Parsing with Abstract Examples Omer Goldman , Veronica Latcinnik, Udi

Establishing Environmental Flows for Northern Ireland Greg McCleary Northern Ireland Environment

porting c++ applications to gpu with openacc for lattice quantum - PowerPoint PPT Presentation

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle 1 , Kate Clark 2 , Carleton DeTar 3 , Meifeng Lin 4 , Verinder Rana 4 , Alejandro Vaquero 3 1 University of Ediburgh 2 NVIDIA 3 University of Utah 4

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Porting Maxwell to the GPU Top Challenges Juan Caada Head of Visualization Next Limit

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Refinement-based Verification of FreeRTOS in VCC 1 Sumesh Divakaran Department of Computer

Abstract Data Types EECS 214, Fall 2017 What is an ADT? An ADT defjnes: An ADT omits: How the

Introduction to Algebraic Topology Constructing Spaces Homotopy The Jack Romo Fundamental

Abstract Interpretation and Program Verification Supratik Chakraborty IIT Bombay Program

Introducing Design Pattern-based Abstract Modeling Construct as a Software Architecture

Operators and Abstraction Announcements Reading Regrades Today is last day to request

Weakly Supervised Semantic Parsing with Abstract Examples Omer Goldman , Veronica Latcinnik, Udi

Establishing Environmental Flows for Northern Ireland Greg McCleary Northern Ireland Environment

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64