Accelerating Quantum Chromodynamics calculations on GPU based - PowerPoint PPT Presentation

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Joó (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU) NVIDIA CUDA Theatre, SC’16 Thomas Jefferson National Accelerator Facility

Introduction • It is believed that the fundamental building blocks of matter are quarks bound together by gluons , via the strong nuclear force. • Quantum Chromodynamics (QCD) is the theory which describes the strong interactions • Understanding how QCD makes up matter and how quarks and gluons behave is a subject of intense experimental scrutiny - only ~5% of the mass of a proton comes from mass of the quarks, rest comes from binding - gluon self-coupling and gluon excitations can create exotic forms of matter Jefferson Lab Brookhaven National Lab glueball: 0 meson: 2 quarks baryon: 3 quarks quarks GlueX in the new Hall-D only gluons of Jefferson Lab@12 GeV. Hunting for exotics! Thomas Jefferson National Accelerator Facility

LQCD Calculation Workflow Gauge Configurations Propagators, Correlation Functions Physics Gauge Generation Analysis Phase 1 Analysis Phase 2 Result • Gauge Generation: Capability Computing on Leadership Facilities - configurations generated in sequence using Markov Chain Monte Carlo technique - focus the power of leadership computing onto single task exploiting data parallelism • Analysis: Capacity computing, cost effective on Clusters - task parallelize over gauge configurations in addition to data parallelism - can use clusters, but also LCFs in throughput (ensemble) mode. Thomas Jefferson National Accelerator Facility

The Wilson-Clover Fermion Matrix ! δ x,y − 1 N d + M − ic SW X X [ γ µ , γ ν ] F µ ν ( x ) µ + (1 + γ µ ) U † M x,y = (1 − γ µ ) U µ ( x ) δ x,x +ˆ µ ( x − ˆ µ ) δ x,x − ˆ µ 8 2 µ< ν µ “Dslash” Term Clover Term (+ Mass Term) • “Dslash” Term is a nearest neighbor stencil like term - very sparse γ 5 = γ † γ 5 M = M † γ 5 • M is J-Hermitian with J= γ 5 : γ 2 5 = 1 γ 5 = γ 1 γ 2 γ 3 γ 4 5 • γ 5 is maximally indefinite (ev-s are 1,-1) Thomas Jefferson National Accelerator Facility

Gauge Configuration Generation • Gauge Generation proceeds via Hybrid Molecular Hypersurface of Constant H Dynamics Monte Carlo (e.g. HMC) • Momentum Update Step needs ‘Force’ term: ( U � , p � ) MD π µ ( x ) ← π µ ( x ) + F µ ( x ) δτ ( U, p ) • Computing F needs to solve linear system: M † M x = b Momentum refreshment • For Wilson-Clover we can use two step solve: ( U, p old ) M † y = b Mx = y Thomas Jefferson National Accelerator Facility

Analysis • Quark Line Diagrams describe Physical Processes • Each line is a Quark Propagator, solution of: q π π Mq = s q q • Many solves needed for each field configuration - e.g. 256 values of t x 386 sources x 4 values of spin q q - x 2 (light and strange quarks) = 790,528 isolves π π q - Typically 200-500 configurations are used t 0 t • Single precision is good enough • Same Matrix, Multiple Right Hand sides Thomas Jefferson National Accelerator Facility

Chroma Software Stack • Layered Software Chroma - Algorithms in Chroma QDP++ , QDP-JIT/LLVM , QDP-JIT/PTX - Chroma coded in terms of QDP++ QPhiX QUDA - Fast Solvers Come from Libraries QMP • QUDA on NVIDIA GPUs The Chroma Software Stack follows the USQCD SciDAC Software Layers • Different QDP++ Implementations provide LatticeFermion psi,chi; ‘performance portability’ for Chroma gaussian(psi); // gaussian RNG fill // shift sites from forward 0 dir. - Chroma is 99% coded in terms of QDP++ constructs // nearest neighbor commmunication chi = shift(psi, FORWARD, 0); - QDP-JIT/PTX and QDP-JIT/LLVM using NVVM for GPUs // Arithmetic expressions on lattice // subsets • Chroma wraps performance optimized libraries chi[ rb[0] ] += psi; // Global reduction - can give e.g. QUDA solvers a ‘Chroma look & feel’ Double n2 = norm2(chi); Example QDP++ Code Thomas Jefferson National Accelerator Facility

Adaptive Multigrid in LQCD • Critical Slowing down is caused by ‘near zero’ modes of M • Multi-Grid (MG) method - separate (project) low lying and high lying modes - reduce error from high lying modes with “smoother” - reduce error from low modes on coarse grid - Gauge field is ‘stochastic’, so no geometric smoothess on low modes => algebraic multigrid - Setting up restriction/prolongation operators has a cost - Easily amortized in Analysis with O(100,000) solves Image Credit: Joanna Griffin, Jefferson Lab Public Affairs Thomas Jefferson National Accelerator Facility

QUDA Implementation Fine grid: parallelism over sites • Outer Flexible Krylov Method: GCR pre smooth post smooth • MG V-cycle used as a Preconditioner. S S - Null space: R P • Solve M x = 0 for N vec random x with BiCGStab • Construct R, P, M c coarse solve - Smoother: fixed number of iterations with MR S S - ‘Bottom Solver’: GCR coarse 1 R 2 P 2 • May be deflated (e.g. FGMRES-DR) later coarse solve • Is recursively preconditioned by next MG level • Coarsest levels may have very few sites coarse 2 - Turn to other ‘fine grained’ sources of parallelism Coarse grid: parallelism over rows,directions Thomas Jefferson National Accelerator Facility

Benefits of Multigrid: Speed 3 x128 sites, m π ~ 200 MeV V=64 • Algorithmic Speed Improvements 50 - 5x-10x compared to BiCGStab QUDA BiCGStab QUDA Adaptive MG • BiCGStab running in optimal 40 configuration: Wallclock Time (sec) - Mostly low precision with ‘Reliable 30 Update’ flying restarts - Mixed Precision (16-bit/64 bit) 20 - ‘Gauge Field’ Compression 10x reduction • MG is a preconditioner in wall-clock time 10 - Can run in reduced precision with flexible outer GCR solver. 0 0 128 192 320 384 448 64 256 512 Cray XK (Titan) Nodes from K. Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

Benfits of Multigrid: Optimality 3 x128 sites, Null={24,24}, 64 nodes of Titan, m π ~192 MeV • MG minimizes error, rather than V=64 400 residuum Multigrid 350 • Solver is better behaved than BiCGStab BiCGStab 300 || error || / || residuum || • number of iterations is stable 250 • || error ||/|| residuum || is more stable 200 • Important for t-to-same-t 150 propagators 100 - single precision is good enough BUT: 50 - want precision guarantee from 0 7 0 1 2 3 4 5 6 8 9 10 11 solve to solve Spin/Color component from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

Benefits of Multigrid: Power Efficiency 120 • Power Draw of a GPU node 12 BiCGStab Solves Power Consumption (W) 100 during BiCGStab and 80 Multgrid running. 60 - GPU Power only (nvidia-smi) 40 BiCGStab 20 • Once setup is complete, 0 120 integrated power for 12 level 1 null vectors level 2 null vectors 12 MG Solves Power Consumption (W) 100 solves is much less than for 80 coarse op BiGCStab 60 construction • Ongoing optimizations (CPU) 40 Multigrid 20 - smarter setup 0 0 100 200 300 400 500 600 - move more work to GPU Wallclock Time (sec) from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

In Praise of Hackathons • Hackathons bring together members of a distributed group for a burst of concentrated activity, to accomplish a concrete development goal. • Hackathons - clear the calendar - are focused — no distractions - Interfaces developers from different sides of interfaces - teach new things (me@OLCF Hack: NVProf, NVIDIA Visual Profiler, Allinea MAP) • Hackathons for QUDA - JLAB: Multi-Node QUDA (way back in 2011?), QUDA Multigrid & Chroma (Jan’16) - Fermilab: QUDA Deflation algorithms - OLCFHack’16: Multigrid in Chroma Gauge Generation, Team Hybrid Titans at OLCFHack. Photograph Courtesy of Sherry Ray, OLCF BiCGStab-L, Staggered Multigrid (Oct’16) Thomas Jefferson National Accelerator Facility

Summary • Taking advantage of modern architectures needs development both in the algorithmic space and in the ‘software’ space - algorithmic optimality, performance optimization, integration with existing codes • Recent QUDA improvements provide Chroma code (and other users) with improved capability. - Multi-Grid solver now in production for propagator calculations on Titan and GPU clusters - Multi-Grid solver has been integrated into Chroma for Gauge Generation projects • Hackathons (a.k.a Code-Fests) are a great way to make rapid advances - We love Hackathons! • Please go and see Kate Clark’s Technical Paper Presentation: 2:30pm, Rm 355-E Thomas Jefferson National Accelerator Facility

Thanks and Acknowledgements • Thanks for Organizing Hackathons: - Chip Watson (Jefferson Lab) for January Multi-Grid Mini-Hackathon at Jefferson Lab - OLCF for the October OLCFHack GPU Hackathon in Knoxville • Results in this talk were generated on the OLCF Titan system (Cray XK7) utilizing USQCD INCITE (LGT003) allocations. • This work is supported by the U.S. Department of Energy, Office of Science, Offices of Nuclear Physics, High Energy Physics, and Advanced Scientific Computing Research. • This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE- AC05-06OR23177. Thomas Jefferson National Accelerator Facility

Accelerating Quantum Chromodynamics calculations on GPU based - PowerPoint PPT Presentation

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Jo (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU)

Many-GPU calculations in Lattice QuantumChromoDynamics Justin Foley, University of Utah

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Simulating Quantum Chromodynamics coupled with Quantum Electromagnetics on the lattice Yuzhi Liu

Feynmans Quantum Paths (Advanced Relativity, Quantum Chromodynamics) Rubin H Landau Sally

Quantum Chromodynamics Lecture 4: Higher orders and all that Hadron Collider Physics Summer

Quantum Weirdness Part 6 Quantum Weirdness in Materials Quantum Cryptography Quantum

Quantum Cryptography 1. Fake Quantum Theory. 2. Simple Quantum Protocols. 3. More Fake Quantum

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Cryptography for the quantum internet Elements of a quantum TLS Christian Majenz

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

AN OVERVIEW OF QUANTUM CHROMODYNAMICS UNIVERSITY OF WASHINGTON PHYS575 FARRAH TAN 12/10/2015 1

Quantum Chromodynamics Lecture 2: Leading order and showers Hadron Collider Physics Summer School

NATIONAL ENVIRONMENTAL FIELD ACCREDITATION PROGRAM PRESENTED BY NEFAP Executive Committee Field

Focus on large scale porphyry discoveries August 5 th , 2020 Vein type

Office of Security Capabilities John Sanders, Assistant Administrator June 24, 2014 OSC

Third Avenue Bridge 2440 Request for Proposals Information Meeting 8/24/16 Welcome

Accreditation, Certification and Training (ACT) Project Tyler Wightman, Director, Operations

Ebbsfleet Development Corporation Planning Committee 20 June 2018 Reference: EDC/17/0155 Site

CROSS BORDER SERVICES crossborderservices.ca FACILITATOR FOR TRADE EXPORT & IMPORT

PRIVATE S EWAGE VARIANCE TRANS ITION WHAT DOES THIS MEAN FOR MUNICIP ALITIES ? Variance

Accelerating Quantum Chromodynamics calculations on GPU based - PowerPoint PPT Presentation

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Jo (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU)

Many-GPU calculations in Lattice QuantumChromoDynamics Justin Foley, University of Utah

porting c++ applications to gpu with openacc for lattice quantum chromodynamics (qcd) Peter Boyle

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Simulating Quantum Chromodynamics coupled with Quantum Electromagnetics on the lattice Yuzhi Liu

Feynmans Quantum Paths (Advanced Relativity, Quantum Chromodynamics) Rubin H Landau Sally

Quantum Chromodynamics Lecture 4: Higher orders and all that Hadron Collider Physics Summer

Quantum Weirdness Part 6 Quantum Weirdness in Materials Quantum Cryptography Quantum

Quantum Cryptography 1. Fake Quantum Theory. 2. Simple Quantum Protocols. 3. More Fake Quantum

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Cryptography for the quantum internet Elements of a quantum TLS Christian Majenz

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

AN OVERVIEW OF QUANTUM CHROMODYNAMICS UNIVERSITY OF WASHINGTON PHYS575 FARRAH TAN 12/10/2015 1

Quantum Chromodynamics Lecture 2: Leading order and showers Hadron Collider Physics Summer School

NATIONAL ENVIRONMENTAL FIELD ACCREDITATION PROGRAM PRESENTED BY NEFAP Executive Committee Field

Focus on large scale porphyry discoveries August 5 th , 2020 Vein type

Office of Security Capabilities John Sanders, Assistant Administrator June 24, 2014 OSC

Third Avenue Bridge 2440 Request for Proposals Information Meeting 8/24/16 Welcome

Accreditation, Certification and Training (ACT) Project Tyler Wightman, Director, Operations

Ebbsfleet Development Corporation Planning Committee 20 June 2018 Reference: EDC/17/0155 Site

CROSS BORDER SERVICES crossborderservices.ca FACILITATOR FOR TRADE EXPORT &amp; IMPORT

PRIVATE S EWAGE VARIANCE TRANS ITION WHAT DOES THIS MEAN FOR MUNICIP ALITIES ? Variance

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

CROSS BORDER SERVICES crossborderservices.ca FACILITATOR FOR TRADE EXPORT & IMPORT