BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - PowerPoint PPT Presentation

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016

QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2

UPDATE ON QUDA

QUDA “QCD on CUDA” – open source, BSD license Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. Latest release 0.8.0 (8th February 2016) Provides: Various solvers for all major fermionic discretizations, with multi-GPU support Additional performance-critical routines needed for gauge-field generation Maximize performance Exploit physical symmetries to minimize memory traffic Mixed-precision methods Autotuning for high performance on all CUDA-capable architectures Domain-decomposed (Schwarz) preconditioners for strong scaling Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) Multigrid solvers for optimal convergence A research tool for how to reach the exascale 4

QUDA - LATTICE QCD ON GPUS http://lattice.github.com/quda lattice / quda Watch 36 Unstar 45 Fork 29 Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit 4,621 commits 49 branches 19 releases 16 contributors Clone or download Clone or download Branch: develop New pull request Create new file Upload files Find file Latest commit f3e2aa7 a day ago … mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago 5 CMakeLists.txt added some comments to CMakeLists.txt 3 months ago

QUDA AUTHORS Ron Babich (NVIDIA) Steve Gottlieb (Indiana University) Michael Baldhauf (Regensburg) Dean Howarth (Rensselaer Polytechnic Institute) Kip Barros (LANL) Bálint Joó (Jlab) Rich Brower (Boston University) Hyung-Jin Kim (BNL -> Samsung) Nuno Cardoso (Lisbon) Claudio Rebbi (Boston University) Kate Clark (NVIDIA) Guochun Shi (NCSA -> Google) Michael Cheng (Boston University) Mario Schröck (INFN) Carleton DeTar (Utah University) Alexei Strelchenko (FNAL) Justin Foley (Utah -> NIH) Alejandro Vaquero (INFN) Joel Giedt (Rensselaer Polytechnic Institute) Mathias Wagner (NVIDIA) Arjun Gambhir (William and Mary) Frank Winter (Jlab)

SOLVERS FOR MULTIPLE RIGHT HAND SIDES

CONJUGATE GRADIENT just as a reminder procedure CG r (0) = b � Ax (0) p (0) = r (0) for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 8

QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 9

QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats 9

QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound WILSON CLOVER DSLASH Dslash arithmetic intensity for HISQ ~ 0.7 Volume = 32 4 GFLOPS Reconstruct 8 Reconstruct 12 Reconstruct 18 2500 exploit SU(3) symmetry: 2000 reconstruct gauge field from 8/12 floats 1500 1000 500 0 K80 P100 K80 P100 K80 P100 Half Half Single Single Double Double 9

QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats 9

QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats Reuse gauge field for multiple rhs 9

QCD PERFORMANCE LIMITERS reconstruct QCD is memory bandwidth bound 18 13 9 2.8 Dslash arithmetic intensity for HISQ ~ 0.7 arithmetic intensity 2.1 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats 1.4 Smearing kills symmetry: stuck with 18 floats 0.7 Reuse gauge field for multiple rhs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 9 # rhs

TESLA PASCAL P100 5.3 TF DP · 10.6 TF SP · 21 TF HP Tesla P100 720 GB/sec Memory Bandwidth for NVLink-enabled Servers 16 GB HBM2 4.7 TF DP · 9.3 TF SP · 18.7 TF HP Tesla P100 Config 1: 16 GB (HBM2), 720 GB/sec for PCIe-Based Servers Config 2: 12 GB (HBM2), 540 GB/sec 10

TITAN X NVIDIA TITAN X 11 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X Pascal architecture 11.0 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X 11

MULTI-SRC DSLASH ON PASCAL Volume 24 4 , HISQ, tuned gauge reconstruct P100 double P100 single Titan X single 2000 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 12

CONJUGATE GRADIENT using multi-src Dslash procedure CG WITH MULTI - SRC D SLASH r i = b i � Ax (0) i p (0) = r (0) i i for k = 1 , 2 , . . . until converged do exploit multi-src Dslash performance n o z ( k − 1) p ( k − 1) � = A i α ( k − 1) = | r ( k − 1) | 2 / h ( p ( k − 1) ) , z ( k − 1) i i i i i do all the linear algebra for each rhs x ( k ) = x ( k − 1) + α ( k − 1) p ( k − 1) i i i i r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) i i i i β ( k − 1) = | r ( k − 1) | 2 / | r ( k ) | 2 i i i p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) same iteration count as CG i i i i end for end procedure 13

MULTI-SRC CG ON PASCAL Volume 24 4 , HISQ P100 single 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 14

BLOCK KRYLOV SPACE SOLVERS

BLOCK KRYLOV SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations 16

BLOCK KRYLOV SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Nakamura et al., CPC 183 (2012) 34–37 Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift) 16

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) P (0) = R (0) for k = 1 , 2 , . . . until converged do Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) = + ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure = + 17

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - PowerPoint PPT Presentation

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016 QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2 UPDATE ON QUDA QUDA QCD on CUDA open source, BSD license Effort started at

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Right Hand Coordinate System Right Hand Rule A rectangular or Cartesian coordinate system

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Whats My Identity? By Miss Elliott Squares vs. Rectangles Squares Rectangles 4 sides

light right light right light right light right to steady the tongue, hold the sides of

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

List of hand outs for this session Hand out 1: Incident decision tree Hand out 2: Yorkshire

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

Rev 5:1, And I saw in the right hand of Him And I saw in the right hand of Him Rev 5:1,

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Hand Hygiene Stefan Morton Hand Hygiene Coordinator Evidence Improved adherence to hand

The Total Least Squares Problem with Multiple Right-Hand Sides A X B Martin Ple singer in

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA

Plan of the Lecture Review: Proportional-Integral-Derivative (PID) control Todays topic:

Singular K3 surfaces and non-homeomorphic conjugate varieties Tokyo, 2007 December Ichiro

Branch groups and their trees, and ordered sets John Wilson jsw13@cam.ac.uk;

4 H - sour Linac Linac4 source ce R&D R&D: : Cusp Cusp free free ICP ICP

Words in non-periodic branch groups Introduction Groups acting on Rooted Trees Elisabeth Fink

Lecture 1.1: An introduction to groups Matthew Macauley Department of Mathematical Sciences

wi4243AP/wi4244AP: Complex Analysis week 3, Monday K. P. Hart Faculty EEMCS TU Delft Delft, 15

Algebraic solutions of the sixth Painlev e equation Oleg Lisovyy LMPT, Tours, France December

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - PowerPoint PPT Presentation

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016 QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2 UPDATE ON QUDA QUDA QCD on CUDA open source, BSD license Effort started at

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Right Hand Coordinate System Right Hand Rule A rectangular or Cartesian coordinate system

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Whats My Identity? By Miss Elliott Squares vs. Rectangles Squares Rectangles 4 sides

light right light right light right light right to steady the tongue, hold the sides of

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

List of hand outs for this session Hand out 1: Incident decision tree Hand out 2: Yorkshire

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

Rev 5:1, And I saw in the right hand of Him And I saw in the right hand of Him Rev 5:1,

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Hand Hygiene Stefan Morton Hand Hygiene Coordinator Evidence Improved adherence to hand

The Total Least Squares Problem with Multiple Right-Hand Sides A X B Martin Ple singer in

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA

Plan of the Lecture Review: Proportional-Integral-Derivative (PID) control Todays topic:

Singular K3 surfaces and non-homeomorphic conjugate varieties Tokyo, 2007 December Ichiro

Branch groups and their trees, and ordered sets John Wilson jsw13@cam.ac.uk;

4 H - sour Linac Linac4 source ce R&amp;D R&amp;D: : Cusp Cusp free free ICP ICP

Words in non-periodic branch groups Introduction Groups acting on Rooted Trees Elisabeth Fink

Lecture 1.1: An introduction to groups Matthew Macauley Department of Mathematical Sciences

wi4243AP/wi4244AP: Complex Analysis week 3, Monday K. P. Hart Faculty EEMCS TU Delft Delft, 15

Algebraic solutions of the sixth Painlev e equation Oleg Lisovyy LMPT, Tours, France December

4 H - sour Linac Linac4 source ce R&D R&D: : Cusp Cusp free free ICP ICP