SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - PowerPoint PPT Presentation

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387

HPC BEYOND MOORE’S LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores 2000 also applies for CPUs (Server to mobile) 1000 need sufficient amount of parallelism to fill architectures 0 Tesla Fermi Kepler Maxwell Pascal 2

HPC BEYOND MOORE’S LAW staying local 10 Tesla GPU Intel CPU Intel MIC CPUs and GPUs becoming wider 8 increase in flops is driven by more 6 cores 4 need sufficient amount of parallelism to fill architectures FLOPS Byte 2 0 3

Lattice Quantum Chromodynamics Classic Conjugate Gradient Solver AGENDA Block Krylov solvers Efficient implementation Time to solution improvements in the wild 4

LATTICE QCD: SOME BASICS Z QCD partition function DAD ¯ Ψ D Ψ e − S E ( T,µ ) Z QCD ( T, µ ) = 4 dimensional grid (=Lattice) includes integral over space and time quarks live on lattice sites gluons live on the links typical sizes: 24 3 x 6 to 256 4 parallelization over lattice sites   (10 5 to 10 9 ) 6

QUDA - LATTICE QCD ON GPUS http://lattice.github.com/quda lattice / quda Watch 36 Unstar 45 Fork 29 Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit 4,621 commits 49 branches 19 releases 16 contributors Clone or download Clone or download Branch: develop New pull request Create new file Upload files Find file Latest commit f3e2aa7 a day ago … mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago 7 CMakeLists.txt added some comments to CMakeLists.txt 3 months ago

QUDA AUTHORS Green: Contributors to this talk Ron Babich (NVIDIA) Dean Howarth (Rensselaer Polytechnic Institute) Michael Baldhauf (Regensburg) Bálint Joó (Jlab) Kip Barros (LANL) Hyung-Jin Kim (BNL -> Samsung) Rich Brower (Boston University) Claudio Rebbi (Boston University) Nuno Cardoso (Lisbon) Guochun Shi (NCSA -> Google) Kate Clark (NVIDIA) Mario Schröck (INFN) Michael Cheng (Boston University) Alexei Strelchenko (FNAL) Carleton DeTar (Utah University) Alejandro Vaquero (Utah University) Justin Foley (NIH) Mathias Wagner (NVIDIA) Joel Giedt (Rensselaer Polytechnic Institute) Evan Weinberg (Boston University) Arjun Gambhir (William and Mary) Frank Winter (Jlab) Steve Gottlieb (Indiana University)

THE OLD WORK HORSE

CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10

CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10

CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = caxpy BLAS operations caxpy BLAS operations h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10

CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = caxpy BLAS operations caxpy BLAS operations h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | Reductions Reductions 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10

STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) 11

        STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil   3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector   complex 3x3 matrix   complex 3x3 matrix complex 3-dim vector   24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 11

        STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil   3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector   complex 3x3 matrix   complex 3x3 matrix complex 3-dim vector   24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte 11

        STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil   3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector   complex 3x3 matrix   complex 3x3 matrix complex 3-dim vector   24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte sensitive to memory bandwidth 11

THE NEW TRACTOR

BLOCK KRYLOV SPACE SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations 13

BLOCK KRYLOV SPACE SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Nakamura et al., CPC 183 (2012) 34–37 Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift) 13

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) P (0) = R (0) for k = 1 , 2 , . . . until converged do Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - PowerPoint PPT Presentation

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387 HPC BEYOND MOORES LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Tracking Perform ance of the MMax Conjugate Gradient Algorithm Bei Xie and Tam al Bose

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

Iterative Methods Mostly for SPD systems Iterative Linear conjugate gradient and its variants

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Compound Light Microscope Light & glass lenses used to magnify objects Eyepiece 10X

Speeding Up Quantifjed Bit-Vector SMT Solvers by Bit-Width Reductions and Extensions Martin

Minimization strategy for choice of the stopping index in conjugate gradient type methods for

Matrix Study of Aged SOFC Performance and Materials Degradation Project FE0026095 Acumentrics

Investing in Hedge Funds Questioning the Status Quo The current dynamic global economic and

Engineers Ireland & Cora The Role of Project Management in Medical Device Manufacturing

NCB Engage Project - Increasing Sustainability NCB Engage Project - Increasing Sustainability 6

Preparing Your Buildings for Climate Change January 31, 2019 Agenda Template Presentation

Continuum Model Oregon State University Honors College HERU 2019 Why Create a Donor Continuum?*

BOTTOM CANYON Idaho Panhandle National Forests

Bottom-Up Methods of Estimating Depot Level Repairables (DLR) Costs Tim Lawless James Jay

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - PowerPoint PPT Presentation

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387 HPC BEYOND MOORES LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Tracking Perform ance of the MMax Conjugate Gradient Algorithm Bei Xie and Tam al Bose

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

Iterative Methods Mostly for SPD systems Iterative Linear conjugate gradient and its variants

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Compound Light Microscope Light &amp; glass lenses used to magnify objects Eyepiece 10X

Speeding Up Quantifjed Bit-Vector SMT Solvers by Bit-Width Reductions and Extensions Martin

Minimization strategy for choice of the stopping index in conjugate gradient type methods for

Matrix Study of Aged SOFC Performance and Materials Degradation Project FE0026095 Acumentrics

Investing in Hedge Funds Questioning the Status Quo The current dynamic global economic and

Engineers Ireland &amp; Cora The Role of Project Management in Medical Device Manufacturing

NCB Engage Project - Increasing Sustainability NCB Engage Project - Increasing Sustainability 6

Preparing Your Buildings for Climate Change January 31, 2019 Agenda Template Presentation

Continuum Model Oregon State University Honors College HERU 2019 Why Create a Donor Continuum?*

BOTTOM CANYON Idaho Panhandle National Forests

Bottom-Up Methods of Estimating Depot Level Repairables (DLR) Costs Tim Lawless James Jay

Compound Light Microscope Light & glass lenses used to magnify objects Eyepiece 10X

Engineers Ireland & Cora The Role of Project Management in Medical Device Manufacturing