accelerating quantum chromodynamics calculations on gpu
play

Accelerating Quantum Chromodynamics calculations on GPU based - PowerPoint PPT Presentation

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Jo (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU)


  1. Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Joó (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU) NVIDIA CUDA Theatre, SC’16 Thomas Jefferson National Accelerator Facility

  2. Introduction • It is believed that the fundamental building blocks of matter are quarks bound together by gluons , via the strong nuclear force. • Quantum Chromodynamics (QCD) is the theory which describes the strong interactions • Understanding how QCD makes up matter and how quarks and gluons behave is a subject of intense experimental scrutiny - only ~5% of the mass of a proton comes from mass of the quarks, rest comes from binding - gluon self-coupling and gluon excitations can create exotic forms of matter Jefferson Lab Brookhaven National Lab glueball: 0 meson: 2 quarks baryon: 3 quarks quarks GlueX in the new Hall-D only gluons of Jefferson Lab@12 GeV. Hunting for exotics! Thomas Jefferson National Accelerator Facility

  3. LQCD Calculation Workflow Gauge Configurations Propagators, Correlation Functions Physics Gauge Generation Analysis Phase 1 Analysis Phase 2 Result • Gauge Generation: Capability Computing on Leadership Facilities - configurations generated in sequence using Markov Chain Monte Carlo technique - focus the power of leadership computing onto single task exploiting data parallelism • Analysis: Capacity computing, cost effective on Clusters - task parallelize over gauge configurations in addition to data parallelism - can use clusters, but also LCFs in throughput (ensemble) mode. Thomas Jefferson National Accelerator Facility

  4. The Wilson-Clover Fermion Matrix ! δ x,y − 1 N d + M − ic SW X X [ γ µ , γ ν ] F µ ν ( x ) µ + (1 + γ µ ) U † M x,y = (1 − γ µ ) U µ ( x ) δ x,x +ˆ µ ( x − ˆ µ ) δ x,x − ˆ µ 8 2 µ< ν µ “Dslash” Term Clover Term (+ Mass Term) • “Dslash” Term is a nearest neighbor stencil like term - very sparse γ 5 = γ † γ 5 M = M † γ 5 • M is J-Hermitian with J= γ 5 : γ 2 5 = 1 γ 5 = γ 1 γ 2 γ 3 γ 4 5 • γ 5 is maximally indefinite (ev-s are 1,-1) Thomas Jefferson National Accelerator Facility

  5. Gauge Configuration Generation • Gauge Generation proceeds via Hybrid Molecular Hypersurface of Constant H Dynamics Monte Carlo (e.g. HMC) • Momentum Update Step needs ‘Force’ term: ( U � , p � ) MD π µ ( x ) ← π µ ( x ) + F µ ( x ) δτ ( U, p ) • Computing F needs to solve linear system: M † M x = b Momentum refreshment • For Wilson-Clover we can use two step solve: ( U, p old ) M † y = b Mx = y Thomas Jefferson National Accelerator Facility

  6. Analysis • Quark Line Diagrams describe Physical Processes • Each line is a Quark Propagator, solution of: q π π Mq = s q q • Many solves needed for each field configuration - e.g. 256 values of t x 386 sources x 4 values of spin q q - x 2 (light and strange quarks) = 790,528 isolves π π q - Typically 200-500 configurations are used t 0 t • Single precision is good enough • Same Matrix, Multiple Right Hand sides Thomas Jefferson National Accelerator Facility

  7. Chroma Software Stack • Layered Software Chroma - Algorithms in Chroma QDP++ , QDP-JIT/LLVM , QDP-JIT/PTX - Chroma coded in terms of QDP++ QPhiX QUDA - Fast Solvers Come from Libraries QMP • QUDA on NVIDIA GPUs The Chroma Software Stack follows the USQCD SciDAC Software Layers • Different QDP++ Implementations provide LatticeFermion psi,chi; ‘performance portability’ for Chroma gaussian(psi); // gaussian RNG fill // shift sites from forward 0 dir. - Chroma is 99% coded in terms of QDP++ constructs // nearest neighbor commmunication chi = shift(psi, FORWARD, 0); - QDP-JIT/PTX and QDP-JIT/LLVM using NVVM for GPUs // Arithmetic expressions on lattice // subsets • Chroma wraps performance optimized libraries chi[ rb[0] ] += psi; // Global reduction - can give e.g. QUDA solvers a ‘Chroma look & feel’ Double n2 = norm2(chi); Example QDP++ Code Thomas Jefferson National Accelerator Facility

  8. Adaptive Multigrid in LQCD • Critical Slowing down is caused by ‘near zero’ modes of M • Multi-Grid (MG) method - separate (project) low lying and high lying modes - reduce error from high lying modes with “smoother” - reduce error from low modes on coarse grid - Gauge field is ‘stochastic’, so no geometric smoothess on low modes => algebraic multigrid - Setting up restriction/prolongation operators has a cost - Easily amortized in Analysis with O(100,000) solves Image Credit: Joanna Griffin, Jefferson Lab Public Affairs Thomas Jefferson National Accelerator Facility

  9. QUDA Implementation Fine grid: parallelism over sites • Outer Flexible Krylov Method: GCR pre smooth post smooth • MG V-cycle used as a Preconditioner. S S - Null space: R P • Solve M x = 0 for N vec random x with BiCGStab • Construct R, P, M c coarse solve - Smoother: fixed number of iterations with MR S S - ‘Bottom Solver’: GCR coarse 1 R 2 P 2 • May be deflated (e.g. FGMRES-DR) later coarse solve • Is recursively preconditioned by next MG level • Coarsest levels may have very few sites coarse 2 - Turn to other ‘fine grained’ sources of parallelism Coarse grid: parallelism over rows,directions Thomas Jefferson National Accelerator Facility

  10. Benefits of Multigrid: Speed 3 x128 sites, m π ~ 200 MeV V=64 • Algorithmic Speed Improvements 50 - 5x-10x compared to BiCGStab QUDA BiCGStab QUDA Adaptive MG • BiCGStab running in optimal 40 configuration: Wallclock Time (sec) - Mostly low precision with ‘Reliable 30 Update’ flying restarts - Mixed Precision (16-bit/64 bit) 20 - ‘Gauge Field’ Compression 10x reduction • MG is a preconditioner in wall-clock time 10 - Can run in reduced precision with flexible outer GCR solver. 0 0 128 192 320 384 448 64 256 512 Cray XK (Titan) Nodes from K. Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

  11. Benfits of Multigrid: Optimality 3 x128 sites, Null={24,24}, 64 nodes of Titan, m π ~192 MeV • MG minimizes error, rather than V=64 400 residuum Multigrid 350 • Solver is better behaved than BiCGStab BiCGStab 300 || error || / || residuum || • number of iterations is stable 250 • || error ||/|| residuum || is more stable 200 • Important for t-to-same-t 150 propagators 100 - single precision is good enough BUT: 50 - want precision guarantee from 0 7 0 1 2 3 4 5 6 8 9 10 11 solve to solve Spin/Color component from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

  12. Benefits of Multigrid: Power Efficiency 120 • Power Draw of a GPU node 12 BiCGStab Solves Power Consumption (W) 100 during BiCGStab and 80 Multgrid running. 60 - GPU Power only (nvidia-smi) 40 BiCGStab 20 • Once setup is complete, 0 120 integrated power for 12 level 1 null vectors level 2 null vectors 12 MG Solves Power Consumption (W) 100 solves is much less than for 80 coarse op BiGCStab 60 construction • Ongoing optimizations (CPU) 40 Multigrid 20 - smarter setup 0 0 100 200 300 400 500 600 - move more work to GPU Wallclock Time (sec) from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility

  13. In Praise of Hackathons • Hackathons bring together members of a distributed group for a burst of concentrated activity, to accomplish a concrete development goal. • Hackathons - clear the calendar - are focused — no distractions - Interfaces developers from different sides of interfaces - teach new things (me@OLCF Hack: NVProf, NVIDIA Visual Profiler, Allinea MAP) • Hackathons for QUDA - JLAB: Multi-Node QUDA (way back in 2011?), QUDA Multigrid & Chroma (Jan’16) - Fermilab: QUDA Deflation algorithms - OLCFHack’16: Multigrid in Chroma Gauge Generation, Team Hybrid Titans at OLCFHack. Photograph Courtesy of Sherry Ray, OLCF BiCGStab-L, Staggered Multigrid (Oct’16) Thomas Jefferson National Accelerator Facility

  14. Summary • Taking advantage of modern architectures needs development both in the algorithmic space and in the ‘software’ space - algorithmic optimality, performance optimization, integration with existing codes • Recent QUDA improvements provide Chroma code (and other users) with improved capability. - Multi-Grid solver now in production for propagator calculations on Titan and GPU clusters - Multi-Grid solver has been integrated into Chroma for Gauge Generation projects • Hackathons (a.k.a Code-Fests) are a great way to make rapid advances - We love Hackathons! • Please go and see Kate Clark’s Technical Paper Presentation: 2:30pm, Rm 355-E Thomas Jefferson National Accelerator Facility

  15. Thanks and Acknowledgements • Thanks for Organizing Hackathons: - Chip Watson (Jefferson Lab) for January Multi-Grid Mini-Hackathon at Jefferson Lab - OLCF for the October OLCFHack GPU Hackathon in Knoxville • Results in this talk were generated on the OLCF Titan system (Cray XK7) utilizing USQCD INCITE (LGT003) allocations. • This work is supported by the U.S. Department of Energy, Office of Science, Offices of Nuclear Physics, High Energy Physics, and Advanced Scientific Computing Research. • This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE- AC05-06OR23177. Thomas Jefferson National Accelerator Facility

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend