S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A - PowerPoint PPT Presentation

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark, Mathias Wagner

Lattice Quantum Chromodynamics QUDA Bandwidth Optimization AGENDA Latency Optimization Multi-node scaling NVSHMEM Summary 2

    QUANTUM CHROMODYNAMICS The strong force is one of the basic forces of nature   (along with gravity, em and weak) It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments) Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.) QCD is the theory of the strong force It’s a beautiful theory…   h Ω i = 1 Z d 4 xL ( U ) Ω ( U ) R [ dU ] e − Z …but i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12 ! 3

LATTICE QUANTUM CHROMODYNAMICS Theory is highly non-linear ⇒ cannot solve directly Must resort to numerical methods to make predictions   Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size Lx x Ly x Lz x Lt Finite spacetime ⇒ periodic boundary conditions PDEs ⇒ finite difference equations Consumer of 10-20% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years Andre Walker-Loud S91010: Accelerating our Understanding of Nuclear Physics and the Early Universe Jiqun Tu S9330: Lattice QCD with Tensor Cores ! 4

STEPS IN AN LQCD CALCULATION D αβ ij ( x, y ; U ) ψ β j ( y ) = η α i ( x ) or Ax = b 1. Generate an ensemble of gluon field configurations “gauge generation” Produced in sequence, with hundreds needed per ensemble Simulation Cost ~ a -6 V 5/4 Strong scaling required with 100-1000 TFLOPS sustained for several months 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 16 4 per GPU 2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver   Many solves per system, e.g., O(10 6 ) Target 24 4 -32 4 per GPU ! 5

LATTICE QCD IN A NUTSHELL h Ω i = 1 Z face d 4 xL ( U ) Ω ( U ) R exchange [ dU ] e − wrap around Z face exchange wrap around theory' experiment) ! ! ! 6 Davies' et#al# Large&Hadron&Collider& Brookhaven*Na,onal*Laboratory*

NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER Summit Becomes First System To Scale The 100 Petaflops Milestone 122 PF 3 EF 27,648 HPC AI Volta Tensor Core GPUs 7

STRONG SCALING Multiple meanings Same problem size, more nodes, more GPUs Same problem, next generation GPUs Multigrid - strong scaling within the same run (not discussed here) To tame strong scaling we have to understand the limiters Bandwidth limiters Latency limiters ! 8

QUDA 9

QUDA • “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. • Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-source solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale ! 10

QUDA CONTRIBUTORS 10 years - lots of contributors § Ron Babich (NVIDIA) § Dean Howarth (BU) Simone Bacchio (Cyprus) Bálint Joó (Jlab) § § § Michael Baldhauf (Regensburg) § Hyung-Jin Kim (BNL -> Samsung) Kip Barros (LANL) Bartek Kostrzewa (Bonn) § § Rich Brower (Boston University) Claudio Rebbi (Boston University) § § Nuno Cardoso (NCSA) Hauke Sandmeyer (Bielefeld) § § Kate Clark (NVIDIA)* Guochun Shi (NCSA -> Google) § § Michael Cheng (Boston University) Mario Schröck (INFN) § § Carleton DeTar (Utah University) Alexei Strelchenko (FNAL) § § Justin Foley (Utah -> NIH) Jiqun Tu (Columbia) § § Joel Giedt (Rensselaer Polytechnic Institute) Alejandro Vaquero (Utah University) § § Arjun Gambhir (William and Mary) Mathias Wagner (NVIDIA)* § § Steve Gottlieb (Indiana University) Evan Weinberg (NVIDIA)* § § *this work § Kyriakos Hadjiyiannakou (Cyprus) § Frank Winter (Jlab) ! 11

LINEAR SOLVERS while (| r k |> ε ) { • β k = ( r k , r k )/( r k-1 , r k-1 ) • p k+1 = r k - β k p k q k+1 = A p k+1 QUDA supports a wide range of linear solvers • α = ( r k , r k )/( p k+1, q k+1 ) • r k+1 = r k - α q k+1 CG, BiCGstab, GCR, Multi-shift solvers, etc. • x k+1 = x k + α p k+1 • k = k+1 } Condition number inversely proportional to mass conjugate Light (realistic) masses are highly singular gradient Naive Krylov solvers suffer from critical slowing down at decreasing mass Entire solver algorithm must run on GPUs Time-critical kernel is the stencil application Also require BLAS level-1 type operations ! 12

MAPPING THE DIRAC OPERATOR TO CUDA x    U x D x,x 0 = • Finite difference operator in LQCD is known as Dslash x   x −  x • Assign a single space-time point to each thread μ U x ν V = XYZT threads, e.g., V = 24 4 => 3.3x10 6 threads • Looping over direction each thread must μ x −  – Load the neighboring spinor (24 numbers x8) – Load the color matrix connecting the sites (18 numbers x8) – Do the computation – Save the result (24 numbers) X[1] • Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity • QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver X[0] ! 13

SINGLE GPU PERFORMANCE “Wilson Dslash” stencil half (1-GPU) single (1-GPU) double (1-GPU) 3000 2500 1013 GB/s 2000 GFLOPS 1500 1119 GB/s 1000 1115 GB/s 500 Tesla V100 CUDA 10.1 0 cf. STREAM 850 GB/s 32 28 24 20 16 12 8 GCC 7.3 Lattice length “strong scaling” ! 14

BANDWIDTH OPTIMIZATION 15

GENERATIONAL COMPARISON F µ ν kernel - batched 3x3 multiplication 6000.00 1000 4000 K80 P100 V100 4500.00 750 3000 GFLOPS 3000.00 2000 500 1500.00 1000 250 4-21% of peak 9-35% of peak 6-37% of peak 0 0.00 0 1000 4000 6000 750 3000 4500 GFLOPS 500 2000 3000 250 1000 1500 0 0 0 ! 16

QUDA’S AUTOTUNER QUDA includes an autotuner for ensuring optimal kernel performance virtual C++ class “ Tunable ” that is derived for each kernel you want to autotune By default Tunable classes will autotune 1-d CTA size, shared memory size, grid size Derived specializations do 2-d and 3-d CTA tuning Tuned parameters are stored in a std::map and dumped to disk for later reuse Built in performance metrics and profiling User just needs to State resource requirements: shared memory per thread and/or per CTA, total number of threads Specify a tuneKey which gives each kernel a unique entry and break any degeneracy ! 17

SINGLE GPU PERFORMANCE “Wilson Dslash” stencil half blocksize=32 single blocksize=32 double blocksize-32 half tuned single tuned double tuned 3000 1312 GB/s 2500 2000 GFLOPS 1291 GB/s 1500 1000 1180 GB/s 500 Tesla V100 CUDA 10.1 0 cf Perfect L2 roofline   32 28 24 20 16 12 8 GCC 7.3 Lattice length ~ 1700 GB/s “strong scaling” ! 18

RECOMPILE AND RUN Autotuning provides performance portability 2,000 Code from 2008 runs unchanged 1,500 1,000 500 GFlop/s 0 2007 2008 2010 2012 2014 2016 2017 ! 19

MULTI GPU BUILDING BLOCKS face Halo packing Kernel exchange wrap Halo packing Kernel around Interior Kernel Interior Kernel Halo communication Halo communication face exchange wrap Halo update Kernel around Halo update Kernel ! 20

Multi-dimensional Kernel Computation 2-d example • Checkerboard updating scheme employed, so only half of the sites are updated per application • Green: source sites • Purple: sites to be updated • Orange: site update complete

Multi-dimensional Kernel Computation Step 1 • Gather boundary sites into contiguous buffers to be shipped off to neighboring GPUs, one direction at a time.

Multi-dimensional Kernel Computation Step 2 • An “interior kernel” updates all local sites to the extent possible. Sites along the boundary receive contributions from local neighbors.

Multi-dimensional Kernel Computation Step 3 • Boundary sites are updated by a series of kernels - one per direction. • A given boundary kernel must wait for its ghost zone to arrive • Note in higher dimensions corner sites have a race condition - serialization of kernels required

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A - PowerPoint PPT Presentation

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark, Mathias Wagner Lattice Quantum Chromodynamics QUDA Bandwidth Optimization AGENDA Latency Optimization Multi-node scaling NVSHMEM Summary 2

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

HPC Clusters: Best Practices and Performance Study Agenda HPC at HPE System

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Bringing Best Practices to a Long-Lived Production Code Charles R. Ferenbaugh HPC Best Practices

Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu Why use Python in HPC?

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

The Effects of Race Conditions when Implementing Single-Source Redundant Clock Trees in Triple

TestOps: Continuous Integration when infrastructure is the product Barry Jaspan Senior

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection

CONCURRENCY IN C++ Yuqing Xia CSCI 5828 Prof. Kenneth M. Anderson University of Colorado at

Blue Cross Blue Shield Massachusettss Role in Addressing Health Equity and Preventing Chronic

Associations Betw een Practice- Reported Medical Homeness and Health Care Utilization Among

Formal sound verification of Linuxs USB BP keyboard driver Willem Penninckx Jan Tobias

Akka Java Documentation Release 2.2.3 1 Thursday, November 28, 13 Terminology, Concepts

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A - PowerPoint PPT Presentation

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark, Mathias Wagner Lattice Quantum Chromodynamics QUDA Bandwidth Optimization AGENDA Latency Optimization Multi-node scaling NVSHMEM Summary 2

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

HPC Clusters: Best Practices and Performance Study Agenda HPC at HPE System

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Bringing Best Practices to a Long-Lived Production Code Charles R. Ferenbaugh HPC Best Practices

Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu Why use Python in HPC?

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

The Effects of Race Conditions when Implementing Single-Source Redundant Clock Trees in Triple

TestOps: Continuous Integration when infrastructure is the product Barry Jaspan Senior

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection

CONCURRENCY IN C++ Yuqing Xia CSCI 5828 Prof. Kenneth M. Anderson University of Colorado at

Blue Cross Blue Shield Massachusettss Role in Addressing Health Equity and Preventing Chronic

Associations Betw een Practice- Reported Medical Homeness and Health Care Utilization Among

Formal sound verification of Linuxs USB BP keyboard driver Willem Penninckx Jan Tobias

Akka Java Documentation Release 2.2.3 1 Thursday, November 28, 13 Terminology, Concepts

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms