EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain - PowerPoint PPT Presentation

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019

Why Mixed Precision Lattice Quantum Chromodynamics Mixed Precision and Krylov Solvers AGENDA Mixed Precision and Multigrid Tensor cores Future Challenges Summary 2

WHY MIXED PRECISION? There are many reasons to consider different precisions Reduce memory traffic Reduce network traffic Reduce memory footprint Accelerated hardware in current architecture Suitable numerical properties for the algorithm at hand Accelerate or even improve the algorithm without compromising the quality of science � 3

LATTICE QCD 4

    QUANTUM CHROMODYNAMICS The strong force is one of the basic forces of nature   (along with gravity, em and weak) It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments) Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.) QCD is the theory of the strong force It’s a beautiful theory…   h Ω i = 1 Z d 4 xL ( U ) Ω ( U ) R [ dU ] e − Z …but i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12 � 5

LATTICE QUANTUM CHROMODYNAMICS Theory is highly non-linear ⇒ cannot solve directly Must resort to numerical methods to make predictions   Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size L x × L y × L z × L t Finite spacetime ⇒ periodic boundary conditions PDEs ⇒ finite difference equations Consumer of 10-20% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years � 6

STEPS IN AN LQCD CALCULATION D αβ ij ( x, y ; U ) ψ β j ( y ) = η α i ( x ) or Ax = b 1. Generate an ensemble of gluon field configurations “gauge generation” Produced in sequence, with hundreds needed per ensemble Simulation Cost ~ a -6 V 5/4 Strong scaling required with 100-1000 TFLOPS sustained for several months 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 16 4 per GPU 2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver   Many solves per system, e.g., O(10 6 ) Target 24 4 -32 4 per GPU � 7

LATTICE QCD IN A NUTSHELL h Ω i = 1 Z face d 4 xL ( U ) Ω ( U ) R exchange [ dU ] e − wrap around Z face exchange wrap around theory' experiment) ! ! � 8 Davies' et#al# Large&Hadron&Collider& Brookhaven*Na,onal*Laboratory*

QUDA • “QCD on CUDA” – http://lattice.github.com/quda (C++14, open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. • Various solvers for all major fermionic discretizations, with multi-GPU support • Maximize performance – Mixed-precision methods (runtime specification of precision for maximum flexibility) – Exploit physical symmetries to minimize memory traffic – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-RHS solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale (and beyond) � 9

QUDA CONTRIBUTORS 10 years - lots of contributors § Ron Babich (NVIDIA) § Dean Howarth (BU) Simone Bacchio (Cyprus) Bálint Joó (Jlab) § § § Michael Baldhauf (Regensburg) § Hyung-Jin Kim (BNL -> Samsung) Kip Barros (LANL) Bartek Kostrzewa (Bonn) § § Rich Brower (Boston University) Claudio Rebbi (Boston University) § § Nuno Cardoso (NCSA) Hauke Sandmeyer (Bielefeld) § § Kate Clark (NVIDIA) Guochun Shi (NCSA -> Google) § § Michael Cheng (Boston University) Mario Schröck (INFN) § § Carleton DeTar (Utah University) Alexei Strelchenko (FNAL) § § Justin Foley (Utah -> NIH) Jiqun Tu (Columbia) § § Joel Giedt (Rensselaer Polytechnic Institute) Alejandro Vaquero (Utah University) § § Arjun Gambhir (William and Mary) Mathias Wagner (NVIDIA) § § Steve Gottlieb (Indiana University) Evan Weinberg (NVIDIA) § § § Kyriakos Hadjiyiannakou (Cyprus) § Frank Winter (Jlab) � 10

MIXED PRECISION AND KRYLOV SOLVERS � 11

MAPPING THE DIRAC OPERATOR TO CUDA x    U x D x,x 0 = • Finite difference operator in LQCD is known as Dslash x   x −  x • Assign a single space-time point to each thread μ U x ν V = XYZT threads, e.g., V = 24 4 => 3.3x10 6 threads • Looping over direction each thread must μ x −  – Load the neighboring spinor (24 numbers x8) – Load the color matrix connecting the sites (18 numbers x8) – Do the computation – Save the result (24 numbers) X[1] • Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity • QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use reduced precision X[0] � 12

MULTI GPU BUILDING BLOCKS face Halo packing Kernel exchange wrap Halo packing Kernel around Interior Kernel Interior Kernel Halo communication Halo communication face exchange wrap Halo update Kernel around Halo update Kernel � 13

SINGLE GPU PERFORMANCE “Wilson Clover” stencil (Chroma) double single 1600 1200 GFLOPS 800 400 Tesla V100 0 8 12 16 20 24 28 32 36 40 CUDA 10.1 GCC 7.3 L � 14

QUDA’S 16-BIT FIXED POINT FORMAT In production since 2009 Link field - Defines the sparse matrix elements SU(3) matrices that live between all adjacent sites on the 4-d grid All elements => very natural to use 16-bit fixed-point representation ∈ [ − 1,1] Fermion field - The vector that appears in the linear solver Each 4-d grid point consists of a 12-component complex vector No a priori bounds on the elements Use per-site L inf norm to normalize the site vector and use 16-bit fixed point Retain global dynamic range with local 16-bit mantissa Low precision used only as a storage type and computation done in FP32 � 15

SINGLE GPU PERFORMANCE “Wilson Clover” stencil (Chroma) double single half 3000 ~1300 GB/s 2250 GFLOPS 1500 ~1300 GB/s 750 ~1200 GB/s Tesla V100 0 8 12 16 20 24 28 32 36 40 CUDA 10.1 GCC 7.3 L � 16

LINEAR SOLVERS while (| r k |> ε ) { • β k = ( r k , r k )/( r k-1 , r k-1 ) • p k+1 = r k - β k p k q k+1 = A p k+1 LQCD requires a range of sparse iterative linear solvers • α = ( r k , r k )/( p k+1, q k+1 ) • r k+1 = r k - α q k+1 CG, BiCGstab, GCR, Multi-shift solvers, etc. • x k+1 = x k + α p k+1 • k = k+1 } Condition number inversely proportional to mass conjugate Light (realistic) masses are highly singular gradient Naive Krylov solvers suffer from critical slowing down at decreasing mass Entire solver algorithm must run on GPUs Time-critical kernel is the stencil application Also require BLAS level-1 type operations � 17

KC , Babich, Barros, Brower, Rebbi (2009) *Sleijpen and Van der Worst, 1996 RELIABLE UPDATES FOR MIXED PRECISION while (| r k |> ε ) { r k = b - A x k solve A p k = r k Traditional approach to mixed precision is to use iterative refinement x k+1 = x k + p k Disadvantage: each restart means we discard the Krylov space } Instead we use Reliable Updates* As low-precision solver progresses, residual will drift if (| r k |< δ | b |) { Occasionally replace iterated residual with high-precision residual r k = b - A x k Retains the Krylov space information b = r k y = y + x k Maintain a separate partial-solution accumulator x k = 0 } Aside: reductions always done in fp64 regardless of data precision � 18

(STABLE) MIXED-PRECISION CG Three key ingredients while (| r k |> ε ) { • β k = ( r k , r k )/( r k-1 , r k-1 ) • p k+1 = r k - β k p k CG convergence relies on gradient vector being orthogonal to residual vector q k+1 = A p k+1 • α = ( r k , r k )/( p k+1, q k+1 ) Re-project when injecting new residual (Strzodka and Gödekke, 2006) • r k+1 = r k - α q k+1 • x k+1 = x k + α p k+1 • k = k+1 (stepsize) chosen to minimize α | e | A } True regardless of underlying precision of process Solution correction is truncated if keep the solution vector in low precision Always keep the (partial) solution vectors in high precision ( r i , r j ) = | r i | 2 δ i , j computation relies on β β k = ( r k , ( r k − r k − 1 )) Not true in finite precision | r k − 1 | 2 Polak-Ribière form is equivalent and self-stabilizing � 19

MIXED-PRECISION MILC CG SOLVER κ ∼ 10 4 δ = 0.1 mass = 0.01 => , V=16 m=0.01 double-half (naive) 1 double-half (new) double 0.01 0.0001 1e-06 1e-08 0 1000 2000 3000 4000 5000 � 20

MIXED-PRECISION MILC CG SOLVER κ ∼ 10 6 δ = 0.1 mass = 0.001, , V=16 m=0.001 10000 double-half (naive) double-half (new) double 1 0.0001 1e-08 0 20000 40000 80000 60000 1e+05 � 21

MIXED-PRECISION MILC CG SOLVER κ ∼ 10 6 δ = 0.1 mass = 0.001, , V=16 m=0.001 10000 double-half (naive) double-half (new) double 1 0.0001 Looking at smarter reliable update triggers based on dynamic error estimates 1e-08 0 20000 40000 80000 e.g. van der Vorst and Ye 60000 1e+05 � 22

MIXED-PRECISION MILC CG SOLVER e l b u o d e l g n i s - e l b u o d f l a h - e l b u o d 0.0 5.0 10.0 15.0 20.0 25.0 solution time in s � 23

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain - PowerPoint PPT Presentation

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed Precision Lattice Quantum Chromodynamics Mixed Precision and Krylov Solvers AGENDA Mixed Precision and Multigrid Tensor cores Future Challenges

Mixed Precision Training PAI Overview What is mixed-precision

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

International Networking in support of Extremely Large Astronomical Data-centric Operations Je

BIG DATA REGIONAL INNOVATION HUBS& SPOKES Overview of the Program for the AI Community Fen

Laying out small plots using tape measures and orange flags: A 2016 on-farm N rate & seed

Software in Scholarship Daniel S. Katz Assistant Director for Scientific Software &

MILLENNIUM INVESTOR PRESENTATION FY16 Half Year Results Investor Presentation Mark Baldwin CEO

Presentationto Joint LegislativeIT Committee April 13, 2006 TheUniversity ofNorth Carolina

The City of Los Angeles, or the City of Los Angeles? Creative Commons Photo Courtesy:

Reconc nciliation and d Self f Det eter ermination Dr. Allen Benson, LLD Native Counselling

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain - PowerPoint PPT Presentation

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed Precision Lattice Quantum Chromodynamics Mixed Precision and Krylov Solvers AGENDA Mixed Precision and Multigrid Tensor cores Future Challenges

Mixed Precision Training PAI Overview What is mixed-precision

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

International Networking in support of Extremely Large Astronomical Data-centric Operations Je

BIG DATA REGIONAL INNOVATION HUBS&amp; SPOKES Overview of the Program for the AI Community Fen

Laying out small plots using tape measures and orange flags: A 2016 on-farm N rate &amp; seed

Software in Scholarship Daniel S. Katz Assistant Director for Scientific Software &amp;

MILLENNIUM INVESTOR PRESENTATION FY16 Half Year Results Investor Presentation Mark Baldwin CEO

Presentationto Joint LegislativeIT Committee April 13, 2006 TheUniversity ofNorth Carolina

The City of Los Angeles, or the City of Los Angeles? Creative Commons Photo Courtesy:

Reconc nciliation and d Self f Det eter ermination Dr. Allen Benson, LLD Native Counselling

BIG DATA REGIONAL INNOVATION HUBS& SPOKES Overview of the Program for the AI Community Fen

Laying out small plots using tape measures and orange flags: A 2016 on-farm N rate & seed

Software in Scholarship Daniel S. Katz Assistant Director for Scientific Software &