Many-GPU calculations in Lattice QuantumChromoDynamics Justin - - PowerPoint PPT Presentation

many gpu calculations in lattice quantumchromodynamics
SMART_READER_LITE
LIVE PREVIEW

Many-GPU calculations in Lattice QuantumChromoDynamics Justin - - PowerPoint PPT Presentation

Many-GPU calculations in Lattice QuantumChromoDynamics Justin Foley, University of Utah SuperComputing 2012 November 13, 2012 Justin Foley, University of Utah Many-GPU calculations in Lattice QCD QCD Quantum ChromoDynamics is the theory of


slide-1
SLIDE 1

Many-GPU calculations in Lattice QuantumChromoDynamics

Justin Foley, University of Utah

SuperComputing 2012

November 13, 2012

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-2
SLIDE 2

QCD

Quantum ChromoDynamics is the theory of the strong nuclear force. One of the fundamental interactions in nature, along with electromagnetism and the weak nuclear force, and gravity. Describes elementary particles called quarks and gluons. Analogy with electromagnetism: quarks ∼ electrons (fundamental matter particles). gluons ∼ photons (force carriers).

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-3
SLIDE 3

Quarks carry a ‘colour’ charge Colour confinement: Quarks and gluons bind together to form composite, colourless particles called hadrons. E.g., protons and neutrons in atomic nuclei. Quarks and gluons can only break free at extreme temperatures (> 1012K) and densities - deconfinement

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-4
SLIDE 4

Applications

Nuclear physics: Can we understand the forces inside atomic nuclei from first principles? Astrophysics: How did the early universe evolve? Quark stars, quark matter within neutron stars. High-energy physics: Search for new physics that cannot be described by the current Standard Model.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-5
SLIDE 5

Lattice QCD

No analytic solutions for QCD in the low-energy (hadronic) regime. Solve QCD on a computer - K.G. Wilson (1975). Approximate space and time by a 4D-grid.

Quarks Gluons

Quarks live on the lattice sites, and gluons reside on the links between sites

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-6
SLIDE 6

Anatomy of a lattice calculation

Physical quantities are given in terms of integrals over the gluon fields: O =

  • DG O[G]e−S[G]
  • DG e−S[G]

Large lattices needed to control discretization and finite-volume effects ⇒ 109-D integrals Markov-Chain Monte Carlo Hybrid Monte Carlo (HMC) algorithm is the method of choice [Duane, Kennedy, Pendleton, and Roweth (1987)]. Combines molecular dynamics with a Metropolis accept-reject step.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-7
SLIDE 7

The use of large space-time lattices and the need to control statistical uncertainties in Monte Carlo integrals make Lattice QCD a major HPC application.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-8
SLIDE 8

MILC

Collaboration uses a Lattice QCD package written in C+MPI. Support for the HISQ (Highly-Improved Staggered Quark) lattice formulation. Four routines take up 99% of HMC time:

1

Solving the linear system Aφ = η.

2

Fermion force: M.D. force due to quarks.

3

Link fattening: to suppress discretisation errors.

4

Gauge force.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-9
SLIDE 9

Linear solves

Typically, the most expensive part of HMC Aφ = η, A = Q†Q, where Q is the HISQ quark matrix, with stencil Solve using iterative Krylov-subspace methods. Use Conjugate Gradient since A is Hermitian, positive-definite.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-10
SLIDE 10

Lattice QCD on GPUs

QUDA: An opensource library for QCD on GPUs lattice.github.com/quda Written in C++ and CUDA. Linear-solver support for multiple lattice formulations. QUDA-0.5.0 - HMC support for the HISQ formulation. Interfaces for common CPU packages: BQCD, Chroma, CPS, QDP, MILC.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-11
SLIDE 11

QUDA performance on a 364 lattice on a single K20X (Gflops). Routine Single Double Mixed Linear solve 156.5 77.1 157.4 Fermion force 191.2 97.2 Fattening 170.7 82.0 Gauge force 194.8 98.3 Does not include data transfer between QUDA and MILC. Mixed double-/single-precision solver uses reliable updating [Sleijpen and van der Vorst].

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-12
SLIDE 12

Linear solves Fermion force* Fattening* Gauge force Other 500 1000 1500 2000 2500 3000 3500 4000 4500 Time(s) 7.70 7.86 1.39 16.36 0.15

2+1-flavor RHMC on 2x(K20X + Sandybridge) MILC MILC+QUDA

Single-precision Rational HMC on a 243x64 lattice. 5.7x net gain in performance >7.7x gain by porting remaining CPU routines.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-13
SLIDE 13

(Mixed-) Double-precision Rational HMC on a 963x192 lattice

  • n Titan

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-14
SLIDE 14

Preliminary look at strong scaling on Titan

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-15
SLIDE 15

Domain-decomposition methods

Each solver iteration involves the transfer of data between GPUs. In large-scale simulations, linear solver is communication bound. Domain decomposition: Solve the preconditioned linear system MAφ = Mη, where M ≈ A−1, but involves less or no inter-processor communication [Additive Schwarz method, Schwarz alternating procedure]. Reduce the number of applications of A and hence inter-GPU communication. Successfully employed in Lattice QCD and in QUDA [L¨ uscher (Lattice QCD and the Schwarz alternating procedure); Babich, Clark et al. (Scaling Lattice QCD beyond 100 GPUs)].

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-16
SLIDE 16

Demonstration of principle on a 323x64 lattice on 4 C2070s Preconditioning results in 2.4x reduction in communication.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD

slide-17
SLIDE 17

Summary and Outlook

Reported on progress in porting Lattice QCD Monte Carlo to GPUs. 99% of MILC HMC ported to QUDA, giving an impressive improvement in performance. Amdahl’s law - need to port remaining CPU routines to QUDA. Persistent data-types to reduce host-device data transfer. Linear solvers dominate state-of-the-art calculations. New algorithms are being implemented which will reduce inter-GPU communication and extend strong scaling on hundreds of GPUs.

Justin Foley, University of Utah Many-GPU calculations in Lattice QCD