Solving Domain Wall Dirac Equation Using Multisplitting - - PowerPoint PPT Presentation

solving domain wall dirac equation using multisplitting
SMART_READER_LITE
LIVE PREVIEW

Solving Domain Wall Dirac Equation Using Multisplitting - - PowerPoint PPT Presentation

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun Tu 1 1 Department of Physics, Columbia University The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo


slide-1
SLIDE 1

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient

Jiqun Tu1

1Department of Physics, Columbia University

The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo Guo, Robert D. Mawhinney, and Jiqun Tu, [arXiv:1804.08593].

slide-2
SLIDE 2

Special thanks to Norman Christ, Chulwoo Jung, and Christopher Kelly.

The RBC & UKQCD collaborations

BNL and BNL/RBRC Ziyuan Bai Norman Christ Duo Guo Christopher Kelly Bob Mawhinney Masaaki T

  • mii

Jiqun Tu Bigeng Wang University of Connecticut Peter Boyle Guido Cossu Luigi Del Debbio T adeusz Janowski Richard Kenway Julia Kettle Fionn O'haigan Brian Pendleton Antonin Portelli T

  • bias T

sang Azusa Yamaguchi Nicolas Garron Jonathan Flynn Vera Guelpers James Harrison Andreas Juettner James Richings Chris Sachrajda Julien Frison Xu Feng Tianle Wang Evan Wickenden Yidi Zhao UC Boulder Renwick Hudspith Yasumichi Aoki (KEK) Mattia Bruno T aku Izubuchi Yong-Chull Jang Chulwoo Jung Christoph Lehner Meifeng Lin Aaron Meyer Hiroshi Ohki Shigemi Ohta (KEK) Amarjit Soni Oliver Witzel Columbia University T

  • m Blum

Dan Hoying (BNL) Luchang Jin (RBRC) Cheng Tu Edinburgh University York University (Toronto) University of Southampton Peking University University of Liverpool KEK Stony Brook University Jun-Sik Yoo Sergey Syritsyn (RBRC) MIT David Murphy

slide-3
SLIDE 3

SUMMIT at ORNL

3/25

Figure 1: The New York Times’s comment on SUMMIT becoming world’s most powerful supercomputer.

slide-4
SLIDE 4

Scaling on SUMMIT at ORNL

4/25

1 10 100 1 10 100 1000 8.85 18.60 94.2 1.55 Tflops/node number of nodes ← single Volta MDWF dslash performance(projected) ← single Volta peak performance(projected)

Figure 2: Half precision M¨

  • bius domain wall fermion CG weak

scaling with local volume of 16 × 123 × 12. 6 NVIDIA Volta GPUs

  • n each compute node. Numbers provided by Chulwoo Jung.
slide-5
SLIDE 5

Motivation

5/25

  • Inter-processor communication is the bottleneck for Dirac

equation solving.

  • For measurement there are many approaches available to

improve the situation: Lanczos, EigCG, split-grid, multigrid, etc.

  • Not the case for evolution.
  • Need an (better) algorithm to reduce the communication
  • verhead and exploit the fascinating local GPU flops.
  • Do more work locally!
slide-6
SLIDE 6

Previous Work

6/25

  • Domain Decomposition/Multiplicative Schwarz[M.

L¨ uscher 2004].

  • Addtive Schwarz[Y. Osaki 2000] and

[R. Babich 2011].

slide-7
SLIDE 7

Multisplitting Algorithm

7/25 For reference see [D. O’leary 1985]. Ax = b : Alxl + Asxs + Arxr = bs

Al As Ar xr xs xl bs × = × x =

slide-8
SLIDE 8

Multisplitting Algorithm

8/25 Solve Alxl + Asxs + Arxr = bs, Rearrange into an iterative form Asx(k+1)

s

= bs − Alx(k)

l

− Arx(k)

r

= bs −

  • Ax(k) − Asx(k)

s

  • = r(k) + Asx(k)

s

≡ ˆ b(k)

s

For each cycle,

  • use communication to calculate the right-hand-side ˆ

bs.

  • solve Asx(k+1)

s

= ˆ b(k)

s

locally.

  • the updated solution x(k+1)

s

will be used to ready the next cycle. Get As for each node by chopping off all off-block-diagonal terms: applying zero Dirichlet boundary condition.

slide-9
SLIDE 9

  • bius Domain Wall Fermion

9/25 Even-odd preconditioning:

  • M5

−κbM 4

eo

−κbM 4

  • e

M5 ψe ψo

  • =

φe φo

  • ,

then instead we solve DPCψe = ˆ φe, DPC ≡ M5 − κ2

bM 4 eoM −1 5 M 4

  • e,

M 4

  • e/eo = Dw

x,y(b5δs,t + c5D5)

Dw

x,y =

  • µ
  • (1 + γµ)U †

x−ˆ µ,µδx−ˆ µ,y + (1 − γµ)U † x,µδx+ˆ µ,y

  • .

Using CG: D†

PCDPCψe = D† PC ˆ

φe

slide-10
SLIDE 10

The Normal Operator

10/25

  • 4 hopping terms in the normal operator:

A = D†

PCDPC

= (M5 − κ2

bM 4 eoM −1 5 M 4

  • e)†(M5 − κ2

bM 4 eoM −1 5 M 4

  • e)
  • This means we need to enforce Dirichlet boundary

condition on D†

PCDPC instead of the individual hopping

terms M 4

eo/oe(Dw x,y).

  • Need to include the snake terms: terms that hop out of

the boundary and hop back.

  • Seems obvious but not trivial to implement.
slide-11
SLIDE 11

The Normal Operator

11/25 The snake terms:

slide-12
SLIDE 12

Dslash Implementation

12/25 before 1st hopping term

slide-13
SLIDE 13

Dslash Implementation

12/25 before 1st hopping term

slide-14
SLIDE 14

Dslash Implementation

12/25 after 1st hopping term

slide-15
SLIDE 15

Dslash Implementation

12/25 before 2ed hopping term

slide-16
SLIDE 16

Dslash Implementation

12/25 after 2ed hopping term

slide-17
SLIDE 17

Dslash Implementation

12/25 before 3rd hopping term

slide-18
SLIDE 18

Dslash Implementation

12/25 before 4th hopping term

slide-19
SLIDE 19

Multisplitting Algorithm

13/25

  • The algorithm converges with inclusion of the snake

terms.

  • The convergence rate is slow.
  • Similar to [M. L¨

uscher 2004] we use its first cycle with zero initial guess as a preconditioner for CG.

  • We use plain CG for the preconditioner solve. Instead of

setting a precision stopping condition we iterate for a fixed number of times(the inner iteration count).

slide-20
SLIDE 20

As a Preconditioner

14/25 r0 = b − Ax0 z0 = M −1r0 p0 = z0 k = 0 while have not converged do αk = rk, zk/pk, Apk xk+1 = xk + αkpk rk+1 = rk − αkApk zk+1 = M −1rk+1 ← Asx(k+1)

s

= r(k) + Asx(k)

s

  • nly first cycle, zero initial guess, iterate a fixed number of times

βk = zk+1, rk+1/zk, rk pk+1 = zk+1 + βkpk k = k + 1 end while

slide-21
SLIDE 21

As a Preconditioner

15/25

A M =

󰁑

s As

slide-22
SLIDE 22

As a Preconditioner

16/25

  • Although starting from a different origin, this is now

effectively the same with addtive Schwarz if one treats the Dirichlet boundary condition correctly.

  • Inclusion of the snake terms is crucial.
  • Naming issue: [A Unified Representation and Theory of

Algebraic Additive Schwarz and Multisplitting Methods,

  • A. Frommer 1997].
  • Multisplitting Preconditioned CG(MSPCG).
slide-23
SLIDE 23

Result: 323 × 64

17/25

10−08 10−07 10−06 10−05 10−04 10−03 10−02 2000 4000 6000 8000 10000 12000 14000 relative precision

  • r2/r2
  • uter iteration count

32x64x12ID, plain CG MSPCG: 3 inner iter. MSPCG: 4 inner iter. MSPCG: 6 inner iter.

Figure 3: MSPCG solve on a 323 × 64 lattice (a−1 = 1.37 GeV) with physical pion mass. Test performed on CORI at NERSC on 128 KNL nodes.

slide-24
SLIDE 24

Result: 323 × 64

18/25

10−10 10−08 10−06 10−04 10−02 10+00 5000 10000 15000 20000 25000 30000 35000 % =

  • r2/s2
  • uter iteration count

plain CG 2 inner iter. 3 inner iter. 4 inner iter. 5 inner iter. 6 inner iter. 7 inner iter. 8 inner iter. 9 inner iter. 10 inner iter.

Figure 4: MSPCG solve on the same lattice. Test performed on 64 nodes at Piz Daint. Solving D†Dx = b instead of D†Dx = D†b. Numbers from Kate Clark.

slide-25
SLIDE 25

Result: 643 × 128

19/25

4000 5000 6000 7000 8000 9000 10000 3 6 9 12 15 18 number of outer iterations to converge number of inner iterations 064 nodes

5464 4237

128 nodes

5631 9993 4492 4298

256 nodes

9741 4800 5823 4655

512 nodes

6008 5083 4948

Figure 5: MSPCG solve on a 643 × 128 lattice (a−1 = 2.36 GeV) with physical pion mass. Plain CG takes 18092 iterations to converge to the same precision(10−10). KNL at CORI.

slide-26
SLIDE 26

Result: 802 × 96 × 192

20/25

10−10 10−09 10−08 10−07 10−06 10−05 10−04 10−03 10−02 2000 4000 6000 8000 10000 12000 14000 16000 relative precision

  • r2/r2
  • uter iteration count

80x80x96x192DED, plain CG MSPCG: 6 inner iter.

Figure 6: MSPCG solve on a 802 × 96 × 192 lattice (a−1 = 3.00 GeV) with physical pion mass. Test performed on CORI at NERSC with 1024 KNL nodes.

slide-27
SLIDE 27

Sloppy Preconditioner Solve

21/25

  • We observe that the number of iterations for outer CG is

greatly reduced even if the inner preconditioner is solved in a sloppy way, e.g. iterating only 3-6 times.

  • Our observation is supported by several theoretical works,

say, [G. Golub 1999] and [V. Simoncini 2003].

  • Thus the number of preconditioner solve is a parameter

that can be tuned to achieve maximum speed up.

slide-28
SLIDE 28

SUMMIT

22/25 For 16 × 123 × 12 local volume on 4 Volta GPUs, preconditioner 14.13 Tflops With the same local volume on 1024 6-Volta-nodes,

  • uter

1.55 Tflops/GPU Assuming a factor of 3 in outer iteration count reduction with 6 inner iterations, the speed up from MSPCG is: 3 1.55 6 ×

more work from snake terms

  • 1.87

14.13 × (6/4)

  • precon. cost

+ 1 1.55

  • uter cost
  • = 1.65
slide-29
SLIDE 29

Code Implementation

23/25

  • First tested in CPS.
  • Fully implemented in Grid1 and Quda2 with help from

Qlattice3.

  • Great thanks to Kate Clark from NVIDIA.

1

https://github.com/paboyle/Grid

2

https://github.com/lattice/quda

3

https://github.com/waterret/Qlattice

slide-30
SLIDE 30

Conclusion

24/25

  • The amount of inter-processor communication could be

reduced at the expense of more local floating point computation by using the multisplitting algorithm as a preconditioner for CG.

  • If the local floating point computation is cheap enough

this greatly speeds up domain wall fermion Dirac equation solving.

slide-31
SLIDE 31

Future Work

25/25

  • On going work on Quda: Speed up preconditioner dslash

as much as possible.

  • The same approach is expected to work for staggered

fermion as well.

  • Spectrum analysis of the matrix A and the preconditioner

M.