REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID Kate Clark, April 6th 2016

CONTENTS Introduction to LQCD QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion

    QUANTUM CHROMODYNAMICS The strong force is one of the basic forces of nature   (along with gravity, em and weak) It’s what binds together the quarks and gluons in the proton   and the neutron (as well as hundreds of other particles seen in   accelerator experiments) QCD is the theory of the strong force It’s a beautiful theory…   h Ω i = 1 Z d 4 xL ( U ) Ω ( U ) R [ dU ] e − Z …but i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12 3

LATTICE QUANTUM CHROMODYNAMICS Theory is highly non-linear ⇒ cannot solve directly Must resort to numerical methods to make predictions Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size L x x L y x L z x L t Finitize spacetime ⇒ periodic boundary conditions PDEs ⇒ finite difference equations High-precision tool that allows physicists to explore the contents of nucleus from the comfort of their workstation (supercomputer) Consumer of 10-20% of public supercomputer cycles 4

STEPS IN AN LQCD CALCULATION 1. Generate an ensemble of gluon field (“gauge”) configurations Produced in sequence, with hundreds needed per ensemble Strong scaling required with O(100 TFLOPS) sustained for several months 50-90% of the runtime is in the linear solver D αβ ij ( x, y ; U ) ψ β j ( y ) = η α i ( x ) or Ax = b 2. “Analyze” the configurations Can be farmed out, assuming O(1 TFLOPS) per job. 80-99% of the runtime is in the linear solver   Task parallelism means that clusters reign supreme here U ( x ) 5

Large Hadron Collider Large Hadron Collider 6 Brookhaven National Laboratory Davies et al

QUDA • “QCD on CUDA” – http://lattice.github.com/quda (open source) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. – Latest release 0.8.0 (8th February 2016) • Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale 7

QUDA COLLABORATORS (multigrid collaborators in green) Steve Gottlieb (Indiana University) § Ron Babich (NVIDIA) § Dean Howarth (Rensselaer Polytechnic Institute) § Michael Baldhauf (Regensburg) § Bálint Joó (Jlab) § Kip Barros (LANL) § Hyung-Jin Kim (BNL -> Samsung) § Rich Brower (Boston University) § Claudio Rebbi (Boston University) § Nuno Cardoso (NCSA) § Guochun Shi (NCSA -> Google) § Kate Clark (NVIDIA) § Mario Schröck (INFN) § Michael Cheng (Boston University) § Alexei Strelchenko (FNAL) § Carleton DeTar (Utah University) § Alejandro Vaquero (INFN) § Justin Foley (Utah -> NIH) § Mathias Wagner (NVIDIA) § Joel Giedt (Rensselaer Polytechnic Institute) § Frank Winter (Jlab) § Arjun Gambhir (William and Mary) § 9

THE DIRAC OPERATOR Quark interactions are described by the Dirac operator First-order PDE acting with a background field Large sparse matrix SU(3) QCD gauge field Dirac spin projector A is the clover matrix (link matrices)   matrices (3x3 color space) (12x12 spin ⊗ color space) (4x4 spin space) 4 M x,x 0 = − 1 P − µ ⊗ U µ µ,x 0 + P + µ ⊗ U µ † X � µ,x 0 � + (4 + m + A x ) δ x,x 0 x δ x +ˆ µ δ x − ˆ x − ˆ 2 µ =1 m quark mass parameter ≡ − 1 2 D x,x 0 + (4 + m + A x ) δ x,x 0 4-d nearest neighbor stencil operator acting on a vector field Eigen spectrum is complex (typically real positive) 10

MAPPING THE DIRAC OPERATOR TO CUDA • Finite difference operator in LQCD is known as Dslash x   • Assign a single space-time point to each thread V = XYZT threads, e.g., V = 24 4 => 3.3x10 6 threads  U x D x,x 0 = • Looping over direction each thread must x   x x −  – Load the neighboring spinor (24 numbers x8) μ U x ν – Load the color matrix connecting the sites (18 numbers x8) – Do the computation μ x −  – Save the result (24 numbers) • Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity • QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver 11

WILSON-DSLASH PERFORMANCE K20X, ECC on, V = 24 3 xT 800 Half 8 GF Half 8 Half 12 Single 8 GF 700 Single 8 Single 12 600 GFLOPS 500 400 300 200 8 32 128 16 64 Temporal Extent 12

LINEAR SOLVERS while (| r k |> ε ) { • β k = ( r k , r k )/( r k-1 , r k-1 ) • p k+1 = r k - β k p k q k+1 = A p k+1 QUDA supports a wide range of linear solvers • α = ( r k , r k )/( p k+1, q k+1 ) • r k+1 = r k - α q k+1 CG, BiCGstab, GCR, Multi-shift solvers, etc. • x k+1 = x k + α p k+1 • k = k+1 } Condition number inversely proportional to mass conjugate Light (realistic) masses are highly singular gradient Naive Krylov solvers suffer from critical slowing down at decreasing mass Entire solver algorithm must run on GPUs Time-critical kernel is the stencil application Also require BLAS level-1 type operations 13

MULTI-GPU DECOMPOSITION face exchange wrap around face exchange wrap around 14

STRONG SCALING Chroma running on Titan with QUDA Clover Propagator Benchmark on Titan: Strong Scaling, QUDA+Chroma+QDP-JIT(PTX) 450 400 350 3 x256 BiCGStab: 72 3 x256 DD+GCR: 72 300 3 x256 BiCGStab: 96 3 x256 DD+GCR: 96 TFLOPS 250 200 150 100 50 B. Joo, F. Winter (JLab), M. Clark (NVIDIA) 0 15 0 512 1024 1536 2048 2560 3072 3584 4096 4608 Titan Nodes (GPUs)

ADAPTIVE MULTIGRID

WHY MULTIGRID? Wallclock'9me'for'Light'Quark'solves'in'Single' 1e+05 3 96 CG 32 Precision'' 3 64 CG 24 3 64 CG 16 70" 3 64 Eig-CG Average'Run'Time'for'1'source'' 24 3 64 Eig-CG 16 60" Dirac operator applications 3 96 MG-GCR 32 10000 3 64 MG-GCR 24 50" 3 64 MG-GCR 16 (seconds)' 40" 30" 1000 20" 10" 0" 100 QUDA"(32"XK"nodes)" Mul:Grid"(16"XE"nodes)"" -0.43 -0.42 -0.41 -0.4 mass Chroma propagator benchmark   Figure by Balint Joo   Babich et al 2010 MG Chroma integration by Saul Cohen   MG Algorithm by James Osborn 17

INTRODUCTION TO MULTIGRID Stationary iterative solvers effective on high frequency errors Minimal effect on low frequency error Example Free Laplace operator in 2d Ax = 0, x 0 = random Gauss Seidel relaxation Plot error e i = -x i 18

INTRODUCTION TO MULTIGRID Low frequency error modes are smooth Can accurately represent on coarse grid Low frequency on fine Falgout => high frequency on coarse Relaxation effective agin on coarse grid Interpolate back to fine grid 19

MULTIGRID V-CYCLE V-CYCLE Solve 1. Smooth SMOOTHER SMOOTHER & RESIDUAL 2. Compute residual INTERPOLATION 3. Restrict residual RESTRICTION SMOOTHER SMOOTHER 4. Recurse on coarse problem & RESIDUAL 5. Prolongate correction 6. Smooth 7. If not converged, goto 1 DIRECT SOLVE Multigrid has optimal scaling O(N) Linear scaling with problem size Convergence rate independent of condition number For LQCD, we do not know the null space components that need to be preserved on the coarse grid 20

ADAPTIVE GEOMETRIC MULTIGRID Adaptively find candidate null-space vectors Dynamically learn the null space and use this to   define the prolongator Algorithm is self learning Setup 1. Set solver to be simple smoother 2. Apply current solver to random vector v i = P(D) η i 3. If convergence good enough, solver setup complete 4. Construct prolongator using fixed coarsening (1 - P R) v k = 0 Falgout 4 geometric blocks ➡ Typically use 4 † γ 5 = P † ➡ Preserve chirality when coarsening R = γ 5 P 5. Construct coarse operator (D c = R D P) 6. Recurse on coarse problem 7. Set solver to be augmented V-cycle, goto 2 21 Babich et al 2010

ADAPTIVE GEOMETRIC MULTIGRID 4-d Laplace operator 1e+05 N v =0 1 N v =1 1e-05 2 ||r|| N v =2 1e-10 Typically 20-30 vectors   1e-15 N v =3 needed to capture   N v =4 Dirac null space 1e-20 0 20 40 60 80 100 N iter 22

MULTIGRID ON GPUS

THE CHALLENGE OF MULTIGRID ON GPU GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU Fine grids run very efficiently High parallel throughput problem Coarse grids are worst possible scenario More cores than degrees of freedom Increasingly serial and latency bound Little’s law (bytes = bandwidth * latency) Amdahl’s law limiter Multigrid exposes many of the problems expected at the Exascale 24

THE CHALLENGE OF MULTIGRID ON GPU GPU PCIe CPU 25

REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID Kate Clark, April 6th 2016 CONTENTS Introduction to LQCD QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion QUANTUM

Lattice QCD Outline 1. Lattice QCD (why and what) 2. Precision flavour physics 3. (g-2) on

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Lattice QCD for Nuclear Physics Saul D. Cohen (for NPLQCD Collaboration) International Workshop

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

The Roberge-Weiss transition Lattice QCD Lattice basics Phase structure of QCD at imaginary

Pion Transition Form Factor from Lattice QCD in Position Space Cheng Tu Lattice 2018, Michigan

Lattice QCD Precision Science for Muon g-2 and EW Physics Kohtaroh Miura (GSI Helmholtz-Institut

The QCD crossover from Lattice QCD July 25, 2018 Patrick Steinbrecher HotQCD collaboration The

How can lattice QCD describe non-zero Introduction Quantum baryonic density ? statistics and

Lattice QCD analysis of charmed tetraquark candidates Yoichi Ikeda (RCNP , Osaka University) HAL

Hadron interactions from lattice QCD Sinya Aoki University of Tsukuba GGI Workshop New

Lattice QCD & the search for BSM physics in beauty Matthew Wingate DAMTP, University of

Lattice Synergy Curtis A. Meyer Carnegie Mellon University May 15 th , 2009 Lattice QCD

The challenge of discovering QCD critical point M. Stephanov M. Stephanov QCD Critical Point

Jacobi Forms of Lattice Index Andreea Mocanu The University of Nottingham 7th of December, 2016

Global Link Singapore 2018 : Abstract for Science Project Submission Deadline 04 May, 2018

of General Graphs Brahim NEGGAZI 1 , Volker TURAU 2 , Mohammed HADDAD 1 , Hamamache KHEDDOUCI 1 1

Coordinating concurrent transmissions: A constant-factor approximation of maximum-weight

Suzlon Energy Limited H1 FY19 14 November 2018 SECI-1 Sembcorp Project Disclaimer This

ICFCA 2014 12 th International Conference on Formal Concept Analysis Cluj-Napoca, Transylvania,

Normalisation by Completeness Thorsten Altenkirch School of Computer Science University of

INVESTOR PRESENTATION FIRST QUARTER 2015 FORWARD LOOKING INFORMATION This presentation is for

REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID Kate Clark, April 6th 2016 CONTENTS Introduction to LQCD QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion QUANTUM

Lattice QCD Outline 1. Lattice QCD (why and what) 2. Precision flavour physics 3. (g-2) on

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Lattice QCD for Nuclear Physics Saul D. Cohen (for NPLQCD Collaboration) International Workshop

QCD phase diagram: overview of recent lattice results Gergely Endr odi University of

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

The Roberge-Weiss transition Lattice QCD Lattice basics Phase structure of QCD at imaginary

Pion Transition Form Factor from Lattice QCD in Position Space Cheng Tu Lattice 2018, Michigan

Lattice QCD Precision Science for Muon g-2 and EW Physics Kohtaroh Miura (GSI Helmholtz-Institut

The QCD crossover from Lattice QCD July 25, 2018 Patrick Steinbrecher HotQCD collaboration The

How can lattice QCD describe non-zero Introduction Quantum baryonic density ? statistics and

Lattice QCD analysis of charmed tetraquark candidates Yoichi Ikeda (RCNP , Osaka University) HAL

Hadron interactions from lattice QCD Sinya Aoki University of Tsukuba GGI Workshop New

Lattice QCD &amp; the search for BSM physics in beauty Matthew Wingate DAMTP, University of

Lattice Synergy Curtis A. Meyer Carnegie Mellon University May 15 th , 2009 Lattice QCD

The challenge of discovering QCD critical point M. Stephanov M. Stephanov QCD Critical Point

Jacobi Forms of Lattice Index Andreea Mocanu The University of Nottingham 7th of December, 2016

Global Link Singapore 2018 : Abstract for Science Project Submission Deadline 04 May, 2018

of General Graphs Brahim NEGGAZI 1 , Volker TURAU 2 , Mohammed HADDAD 1 , Hamamache KHEDDOUCI 1 1

Coordinating concurrent transmissions: A constant-factor approximation of maximum-weight

Suzlon Energy Limited H1 FY19 14 November 2018 SECI-1 Sembcorp Project Disclaimer This

ICFCA 2014 12 th International Conference on Formal Concept Analysis Cluj-Napoca, Transylvania,

Normalisation by Completeness Thorsten Altenkirch School of Computer Science University of

INVESTOR PRESENTATION FIRST QUARTER 2015 FORWARD LOOKING INFORMATION This presentation is for

Lattice QCD & the search for BSM physics in beauty Matthew Wingate DAMTP, University of