Accelerating Quantum Chromodynamics calculations on GPU based - - PowerPoint PPT Presentation

accelerating quantum chromodynamics calculations on gpu
SMART_READER_LITE
LIVE PREVIEW

Accelerating Quantum Chromodynamics calculations on GPU based - - PowerPoint PPT Presentation

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Jo (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU)


slide-1
SLIDE 1

Thomas Jefferson National Accelerator Facility

Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm

  • K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M),
  • B. Joó (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU)

NVIDIA CUDA Theatre, SC’16

slide-2
SLIDE 2

Thomas Jefferson National Accelerator Facility

Introduction

  • It is believed that the fundamental building blocks of matter are quarks bound

together by gluons, via the strong nuclear force.

  • Quantum Chromodynamics (QCD) is the theory which describes the strong

interactions

  • Understanding how QCD makes up matter and how quarks and gluons behave is

a subject of intense experimental scrutiny

  • only ~5% of the mass of a proton comes from mass of the quarks, rest comes from binding
  • gluon self-coupling and gluon excitations can create exotic forms of matter

meson: 2 quarks baryon: 3 quarks glueball: 0 quarks

  • nly gluons

GlueX in the new Hall-D

  • f Jefferson Lab@12 GeV.

Hunting for exotics!

Jefferson Lab Brookhaven National Lab

slide-3
SLIDE 3

Thomas Jefferson National Accelerator Facility

LQCD Calculation Workflow

  • Gauge Generation: Capability Computing on Leadership Facilities
  • configurations generated in sequence using Markov Chain Monte Carlo technique
  • focus the power of leadership computing onto single task exploiting data parallelism
  • Analysis: Capacity computing, cost effective on Clusters
  • task parallelize over gauge configurations in addition to data parallelism
  • can use clusters, but also LCFs in throughput (ensemble) mode.

Gauge Generation Analysis Phase 1 Analysis Phase 2 Physics Result

Gauge Configurations Propagators, Correlation Functions

slide-4
SLIDE 4

Thomas Jefferson National Accelerator Facility

The Wilson-Clover Fermion Matrix

  • “Dslash” Term is a nearest neighbor stencil like term - very sparse
  • M is J-Hermitian with J=γ5 :
  • γ5 is maximally indefinite (ev-s are 1,-1)

Clover Term (+ Mass Term) “Dslash” Term γ5M = M †γ5 γ5 = γ1γ2γ3γ4

γ2

5 = 1

γ5 = γ†

5 Mx,y = Nd + M − icSW 8 X

µ<ν

[γµ, γν] F µν(x) ! δx,y − 1 2 X

µ

(1 − γµ) Uµ(x)δx,x+ˆ

µ + (1 + γµ) U † µ(x − ˆ

µ)δx,x−ˆ

µ

slide-5
SLIDE 5

Thomas Jefferson National Accelerator Facility

Gauge Configuration Generation

  • Gauge Generation proceeds via Hybrid Molecular

Dynamics Monte Carlo (e.g. HMC)

  • Momentum Update Step needs ‘Force’ term:
  • Computing F needs to solve linear system:
  • For Wilson-Clover we can use two step solve:

Hypersurface of Constant H Momentum refreshment MD

(U, pold) (U, p) (U , p)

πµ(x) ← πµ(x) + Fµ(x) δτ M †M x = b M †y = b Mx = y

slide-6
SLIDE 6

Thomas Jefferson National Accelerator Facility

Analysis

  • Quark Line Diagrams describe Physical Processes
  • Each line is a Quark Propagator, solution of:
  • Many solves needed for each field configuration
  • e.g. 256 values of t x 386 sources x 4 values of spin
  • x 2 (light and strange quarks) = 790,528 isolves
  • Typically 200-500 configurations are used
  • Single precision is good enough
  • Same Matrix, Multiple Right Hand sides

q q q q q q

t0 t π π π π

Mq = s

slide-7
SLIDE 7

Thomas Jefferson National Accelerator Facility

Chroma Software Stack

  • Layered Software
  • Algorithms in Chroma
  • Chroma coded in terms of QDP++
  • Fast Solvers Come from Libraries
  • QUDA on NVIDIA GPUs
  • Different QDP++ Implementations provide

‘performance portability’ for Chroma

  • Chroma is 99% coded in terms of QDP++ constructs
  • QDP-JIT/PTX and QDP-JIT/LLVM using NVVM for GPUs
  • Chroma wraps performance optimized libraries
  • can give e.g. QUDA solvers a ‘Chroma look & feel’

Chroma QPhiX QUDA QMP

QDP++ , QDP-JIT/LLVM , QDP-JIT/PTX

LatticeFermion psi,chi; gaussian(psi); // gaussian RNG fill // shift sites from forward 0 dir. // nearest neighbor commmunication chi = shift(psi, FORWARD, 0); // Arithmetic expressions on lattice // subsets chi[rb[0]] += psi; // Global reduction Double n2 = norm2(chi);

The Chroma Software Stack follows the USQCD SciDAC Software Layers Example QDP++ Code

slide-8
SLIDE 8

Thomas Jefferson National Accelerator Facility

Adaptive Multigrid in LQCD

  • Critical Slowing down is caused by ‘near zero’ modes of M
  • Multi-Grid (MG) method
  • separate (project) low lying and high lying modes
  • reduce error from high lying modes with “smoother”
  • reduce error from low modes on coarse grid
  • Gauge field is ‘stochastic’, so no geometric smoothess on low

modes => algebraic multigrid

  • Setting up restriction/prolongation operators has a cost
  • Easily amortized in Analysis with O(100,000) solves

Image Credit: Joanna Griffin, Jefferson Lab Public Affairs

slide-9
SLIDE 9

Thomas Jefferson National Accelerator Facility

QUDA Implementation

  • Outer Flexible Krylov Method: GCR
  • MG V-cycle used as a Preconditioner.
  • Null space:
  • Solve M x = 0 for Nvec random x with BiCGStab
  • Construct R, P, Mc
  • Smoother: fixed number of iterations with MR
  • ‘Bottom Solver’: GCR
  • May be deflated (e.g. FGMRES-DR) later
  • Is recursively preconditioned by next MG level
  • Coarsest levels may have very few sites
  • Turn to other ‘fine grained’ sources of parallelism

S S

pre smooth post smooth

S S

coarse solve

R P

coarse solve

R2 P2 coarse 1 coarse 2

Fine grid: parallelism over sites

Coarse grid: parallelism over rows,directions

slide-10
SLIDE 10

Thomas Jefferson National Accelerator Facility

Benefits of Multigrid: Speed

64 128 192 256 320 384 448 512 Cray XK (Titan) Nodes 10 20 30 40 50 Wallclock Time (sec) QUDA BiCGStab QUDA Adaptive MG V=64

3x128 sites, mπ ~ 200 MeV

  • Algorithmic Speed Improvements
  • 5x-10x compared to BiCGStab
  • BiCGStab running in optimal

configuration:

  • Mostly low precision with ‘Reliable

Update’ flying restarts

  • Mixed Precision (16-bit/64 bit)
  • ‘Gauge Field’ Compression
  • MG is a preconditioner
  • Can run in reduced precision with

flexible outer GCR solver.

from K. Clark et. al. SC’16 - sneak preview

10x reduction in wall-clock time

slide-11
SLIDE 11

Thomas Jefferson National Accelerator Facility

Benfits of Multigrid: Optimality

  • MG minimizes error, rather than

residuum

  • Solver is better behaved than

BiCGStab

  • number of iterations is stable
  • || error ||/|| residuum || is more stable
  • Important for t-to-same-t

propagators

  • single precision is good enough

BUT:

  • want precision guarantee from

solve to solve

from Clark et. al. SC’16 - sneak preview

1 2 3 4 5 6 7 8 9 10 11 Spin/Color component 50 100 150 200 250 300 350 400 || error || / || residuum || Multigrid BiCGStab V=64

3x128 sites, Null={24,24}, 64 nodes of Titan, mπ~192 MeV

slide-12
SLIDE 12

Thomas Jefferson National Accelerator Facility

Benefits of Multigrid: Power Efficiency

  • Power Draw of a GPU node

during BiCGStab and Multgrid running.

  • GPU Power only (nvidia-smi)
  • Once setup is complete,

integrated power for 12 solves is much less than for BiGCStab

  • Ongoing optimizations
  • smarter setup
  • move more work to GPU

20 40 60 80 100 120 Power Consumption (W) BiCGStab 100 200 300 400 500 600 Wallclock Time (sec) 20 40 60 80 100 120 Power Consumption (W) Multigrid

12 BiCGStab Solves 12 MG Solves level 1 null vectors level 2 null vectors coarse op construction (CPU)

from Clark et. al. SC’16 - sneak preview

slide-13
SLIDE 13

Thomas Jefferson National Accelerator Facility

In Praise of Hackathons

  • Hackathons bring together members of a distributed

group for a burst of concentrated activity, to accomplish a concrete development goal.

  • Hackathons
  • clear the calendar
  • are focused — no distractions
  • Interfaces developers from different sides of interfaces
  • teach new things (me@OLCF Hack: NVProf, NVIDIA

Visual Profiler, Allinea MAP)

  • Hackathons for QUDA
  • JLAB: Multi-Node QUDA (way back in 2011?), QUDA

Multigrid & Chroma (Jan’16)

  • Fermilab: QUDA Deflation algorithms
  • OLCFHack’16: Multigrid in Chroma Gauge Generation,

BiCGStab-L, Staggered Multigrid (Oct’16)

Team Hybrid Titans at OLCFHack. Photograph Courtesy of Sherry Ray, OLCF

slide-14
SLIDE 14

Thomas Jefferson National Accelerator Facility

Summary

  • Taking advantage of modern architectures needs development both in the

algorithmic space and in the ‘software’ space

  • algorithmic optimality, performance optimization, integration with existing codes
  • Recent QUDA improvements provide Chroma code (and other users) with

improved capability.

  • Multi-Grid solver now in production for propagator calculations on Titan and GPU clusters
  • Multi-Grid solver has been integrated into Chroma for Gauge Generation projects
  • Hackathons (a.k.a Code-Fests) are a great way to make rapid advances
  • We love Hackathons!
  • Please go and see Kate Clark’s Technical Paper Presentation: 2:30pm, Rm 355-E
slide-15
SLIDE 15

Thomas Jefferson National Accelerator Facility

Thanks and Acknowledgements

  • Thanks for Organizing Hackathons:
  • Chip Watson (Jefferson Lab) for January Multi-Grid Mini-Hackathon at Jefferson Lab
  • OLCF for the October OLCFHack GPU Hackathon in Knoxville
  • Results in this talk were generated on the OLCF Titan system (Cray XK7) utilizing

USQCD INCITE (LGT003) allocations.

  • This work is supported by the U.S. Department of Energy, Office of Science,

Offices of Nuclear Physics, High Energy Physics, and Advanced Scientific Computing Research.

  • This material is based upon work supported by the U.S. Department of Energy,

Office of Science, Office of Nuclear Physics under contract DE- AC05-06OR23177.