Lattice QCD, Programming Models and Porting LQCD codes to Exascale - PowerPoint PPT Presentation

Lattice QCD, Programming Models and Porting LQCD codes to Exascale Bálint Joó - Jefferson Lab Feb 19, 2020 HPC Roundtable Thomas Jefferson National Accelerator Facility

LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice a Thomas Jefferson National Accelerator Facility

LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) a Thomas Jefferson National Accelerator Facility

LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) • Strong Force Gauge fields on links: 3x3 complex matrices a Thomas Jefferson National Accelerator Facility

LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) • Strong Force Gauge fields on links: 3x3 complex matrices • Interactions are typically local - closed loops (3-matrix x 3-matrix) - covariant stencils (3-matrix x 3-vector ) • Also lattice wide summations: a - global sums, inner products etc. • Extremely well suited to data-parallel approaches - complex numbers and factors of 3 are often unfriendly to automatic vectorization - we need to usually build that in. Thomas Jefferson National Accelerator Facility

Typical LQCD Workflow D. J. Wilson et. al. PRD 91, 054008 (2015) … D. J. Wilson et. al. PRD 91, 054008 (2015) 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 … 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 Correlation Propagators, graph nodes & edges Graph Contractions Function Configuration Generation eigenvectors etc. - O(10K)-O(100K) diagrams Fitting and Analysis - Hybrid Molecular Dynamics Monte Carlo - sub-diagram reuse challenge - Linear Solves for quark propagators on sources - Linear Solves for Fermion Forces - workstations - main operation is batched - e.g. O(1M) solves/config for spectroscopy - Data parallel code for non-solver parts - Solver: same matrix, many right hand sides ZGEMM - Strong Scaling Limited - Potential large scale I/O challenge - Throughput limited - ‘Large’ long running jobs - Ensemble: Many single node jobs - Ensemble: Many small jobs Thomas Jefferson National Accelerator Facility

General Software Organization • Level structure worked out over last 4 iterations of the SciDAC program Apps Chroma CPS MILC • Data Parallel Layer (QDP) over a communications abstraction layer, MGProto & presents programmer with a ‘virtual Libraries QUDA QPhiX grid machine’ Grid • Applications can be written on top of Data Parallel QDP++/QDP-JIT/QDP-C the Data Parallel Layer, calling out to Highly Optimized Libraries as needed. Comms QMP • Grid is a new code, also providing a data parallel layer, and similar MPI/Other Comms layering internally (but not broken out into separate packages) Thomas Jefferson National Accelerator Facility

General Software Organization • Level structure worked out over last 4 Key Goals: years of SciDAC Apps Chroma CPS MILC • Data Parallel Layer (QDP) over a Port Data Parallel Layer, communications abstraction layer, MGProto & presents programmer with a ‘virtual Libraries QUDA QPhiX grid machine’ Port Libraries, Grid • Applications written on top of the Data Parallel QDP++/QDP-JIT/QDP-C Data Parallel Layer, calling out to Aim for Performance Highly Optimized Libraries as needed. Comms QMP • Grid is a new code, also providing a Portability data parallel layer, and similar MPI/Other Comms layering internally (but not broken out into separate packages) Thomas Jefferson National Accelerator Facility

Exascale & Pre-Exascale Systems • Perlmutter (formerly NERSC-9) - AMD CPUs, NVIDIA Next Gen GPUs. - Slingshot fabric from Cray • Aurora - Xeon CPUs + Intel Xe Accelerators - Slingshot fabric from Cray • Frontier - AMD CPUs + AMD Radeon GPUs - Slingshot fabric from Cray • MPI + X programming model • Horsepower for all the systems will come from accelerators • But the accelerators are different between the 3 systems Thomas Jefferson National Accelerator Facility

Node Programming Model Options Support OpenMP O ffl oad Kokkos/Raja DPC++/SYCL HIP C++ pSTL CUDA NVIDIA GPU AMD GPU Intel Xe CPUs Fortran FPGAs NVIDIA via POCL or Fortran via cross The way of the future? Fortran via PGI CUDA Compilers Maturing, DPC++ and HIP back Codeplay Backend, AMD calling, well parallelism in the base Fortran, well Comments some C++ issues ends in development via hipSYCL for now, well supported for AMD language. Tech supported for NVIDIA supported for Intel GPUs previews just now GPUs Can be made to work In development Supported via 3rd party extension Not supported or aspirational or product or hack Disclaimer: this is my current view, products and support levels can change. This picture may become out of date very soon Thomas Jefferson National Accelerator Facility

OpenMP Offload • Offloaded axpy in OpenMP #pragma omp target teams distribute parallel for simd map(to:z[:N]) map(a,x[:N],y[:N]) for(int i=0; i < N; i++) // N is large { z[i] = a*x[i] + b[i]; } • Collapses: - omp target - target the accelerator, - omp teams - create a league of teams - omp distribute - distribute the works amongst the teams - omp parallel for simd - perform a SIMD-ized parallel for - map a, x and y to the accelerator and map resulting z back out (data movement). Thomas Jefferson National Accelerator Facility

HIP • HIP is AMD’s “C++ Heterogeneous-Compute Interface for Portability” • Take your CUDA API and replace ‘cuda’ with ‘hip’: - cudaMemcpy() -> hipMemcpy() - kernel<<>>( ) -> hipLaunchKernelGGL(kernel,…) - and other slight changes. - You can use hipify tool to do first pass of conversion automatically • Open Source • Portability between NVIDIA and AMD GPUs only. Thomas Jefferson National Accelerator Facility

Kokkos Kokkos::View<float[N],LayoutLeft,CudaSpace> x(“x”); // N is large Kokkos::View<float[N],LayoutLeft,CudaSpace> y(“y”); Kokkos::View<float[N],LayoutLeft,CudaSpace> z(“z”); float a=0.5; Kokkos::parallel_for(“zaxpy", N, KOKKOS_LAMBDA (const int& i) { z(i) = a*x(i) + y(i); // view provides indexing operator() }); • View - multi-dimensional array, index order specified by Layout, location by MemorySpace policy. Layout allows appropriate memory access for CPU/GPU • Parallel for dispatches a C++ lambda • Kokkos developers on C++ standards committee - work to fold features into C++ Thomas Jefferson National Accelerator Facility

Portability via Kokkos • Kokkos provides portability via back- ends: e.g. OpenMP, CUDA, … Kokkos Abstractions • Most abstractions are provided in a C++ Header library - parallel_for, reduction, scans • Kokkos provides the Kokkos View SYCL/ OpenMP CUDA OpenMP HIP DPCPP target data-type Back-End Back-End Back-End Back-End Back-End - user can customize index order - explicit memory movement only Stable and - select memory space via policy In Development Production ready • Bind Execution to Execution Space - select back end via policy Thomas Jefferson National Accelerator Facility

SYCL SYCL runtime • SYCL manages sycl::queue myQueue; manages data in sycl::buffer<float,1> x_buf(LARGE_N); buffers buffers sycl::buffer<float,1> y_buf(LARGE_N); • Only access buffers sycl::buffer<float,1> z_buf(LARGE_N); access buffer data via accessors in via accessors // … fill buffers somehow … command group • can track accessor float a = 0.5; (cgh) scope or host { accessor use and build data myQueue.submit([&](handler& cgh) { auto x=x_buf. getAccess <access::mode::read>(cgh); dependency graph to auto y=y_buf. getAccess <access::mode::read>(cgh); automate data auto z=z_buf. getAccess <access::mode::write>(cgh); movement cgh. parallel_for < class zaxpy >(LARGE_N,[=](id<1> id) { • What does this mean auto i = id[0]; z[i]=a*x[i] + y[i]; kernels must have a for non SyCL }); unique name in C++ Libraries with }); pointers? (e.g. MPI) } Thomas Jefferson National Accelerator Facility

Intel OneAPI DPC++ extensions USM gives host/ • USM extension allows sycl::queue myQueue; device pointers management of arrays via sycl::device dev=myQueue.get_device(); and pointers (more CUDA-like) sycl::context con=myQueue.get_context(); • Memcpy ops to move data float* x=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); between host and device float* y=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); (not shown here) float* z=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); • Reductions !! // … fill aarrays somehow somehow … • Unnamed Lambda extension float a = 0.5; obviates need for a class { name for parallel for myQueue.submit([&](handler& cgh) { • Libraries (e.g. MPI) can do cgh. parallel_for (LARGE_N,[=](id<1> id) { auto i = id[0]; intelligent things with USM z[i]=a*x[i] + y[i]; pointers (e.g. direct device }); access) Unnamed lambda extension }); • Subgroup Extension allows } more explicit SIMD-ization // free pointers etc.. Thomas Jefferson National Accelerator Facility

Lattice QCD, Programming Models and Porting LQCD codes to Exascale - PowerPoint PPT Presentation

Lattice QCD, Programming Models and Porting LQCD codes to Exascale Blint Jo - Jefferson Lab Feb 19, 2020 HPC Roundtable Thomas Jefferson National Accelerator Facility LQCD as an application Replace Spacetime with a 4-Dimentional

Bill Boroski LQCD-ext II Contractor Project Manager boroski@fnal.gov Robert D. Kennedy LQCD-ext

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Lattice QCD Outline 1. Lattice QCD (why and what) 2. Precision flavour physics 3. (g-2) on

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

The Roberge-Weiss transition Lattice QCD Lattice basics Phase structure of QCD at imaginary

Lattice QCD for Nuclear Physics Saul D. Cohen (for NPLQCD Collaboration) International Workshop

Pion Transition Form Factor from Lattice QCD in Position Space Cheng Tu Lattice 2018, Michigan

Baryon forces from Lattice QCD Tetsuo Hatsuda (iTHEMS, RIKEN) YITP WS, May 17, 2017 Contents 1.

The QCD crossover from Lattice QCD July 25, 2018 Patrick Steinbrecher HotQCD collaboration The

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

How can lattice QCD describe non-zero Introduction Quantum baryonic density ? statistics and

Lattice QCD analysis of charmed tetraquark candidates Yoichi Ikeda (RCNP , Osaka University) HAL

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Programming for Hybrid Architectures John E. Stone Theoretical and Computational Biophysics Group

Inference and Learning for Probabilistic Logic Programming Fabrizio Riguzzi Dipartimento di

HPC for Computational Astrophysics: Looking Forward Ann Almgren Center for Computational

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time

t rts r

Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality

Special Topics in AI: Intelligent Agents and Multi-Agent Systems No guarantees DSA-1,

Lattice QCD, Programming Models and Porting LQCD codes to Exascale - PowerPoint PPT Presentation

Lattice QCD, Programming Models and Porting LQCD codes to Exascale Blint Jo - Jefferson Lab Feb 19, 2020 HPC Roundtable Thomas Jefferson National Accelerator Facility LQCD as an application Replace Spacetime with a 4-Dimentional

Bill Boroski LQCD-ext II Contractor Project Manager boroski@fnal.gov Robert D. Kennedy LQCD-ext

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Lattice QCD Outline 1. Lattice QCD (why and what) 2. Precision flavour physics 3. (g-2) on

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

The Roberge-Weiss transition Lattice QCD Lattice basics Phase structure of QCD at imaginary

Lattice QCD for Nuclear Physics Saul D. Cohen (for NPLQCD Collaboration) International Workshop

Pion Transition Form Factor from Lattice QCD in Position Space Cheng Tu Lattice 2018, Michigan

Baryon forces from Lattice QCD Tetsuo Hatsuda (iTHEMS, RIKEN) YITP WS, May 17, 2017 Contents 1.

The QCD crossover from Lattice QCD July 25, 2018 Patrick Steinbrecher HotQCD collaboration The

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

How can lattice QCD describe non-zero Introduction Quantum baryonic density ? statistics and

Lattice QCD analysis of charmed tetraquark candidates Yoichi Ikeda (RCNP , Osaka University) HAL

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Programming for Hybrid Architectures John E. Stone Theoretical and Computational Biophysics Group

Inference and Learning for Probabilistic Logic Programming Fabrizio Riguzzi Dipartimento di

HPC for Computational Astrophysics: Looking Forward Ann Almgren Center for Computational

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time

t rts r

Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality

Special Topics in AI: Intelligent Agents and Multi-Agent Systems No guarantees DSA-1,

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64