Solving Domain Wall Dirac Equation Using Multisplitting - PowerPoint PPT Presentation

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun Tu 1 1 Department of Physics, Columbia University The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo Guo, Robert D. Mawhinney, and Jiqun Tu, [arXiv:1804.08593].

Special thanks to Norman Christ, Chulwoo Jung, and Christopher Kelly. The RBC & UKQCD collaborations Tianle Wang BNL and BNL/RBRC University of Liverpool Evan Wickenden Nicolas Garron Yasumichi Aoki (KEK) Yidi Zhao Mattia Bruno MIT University of Connecticut T aku Izubuchi Yong-Chull Jang David Murphy T om Blum Chulwoo Jung Dan Hoying (BNL) Peking University Christoph Lehner Luchang Jin (RBRC) Xu Feng Meifeng Lin Cheng Tu Aaron Meyer University of Southampton Edinburgh University Hiroshi Ohki Jonathan Flynn Shigemi Ohta (KEK) Peter Boyle Vera Guelpers Amarjit Soni Guido Cossu James Harrison UC Boulder Luigi Del Debbio Andreas Juettner T adeusz Janowski Oliver Witzel James Richings Richard Kenway Chris Sachrajda Columbia University Julia Kettle Fionn O'haigan Ziyuan Bai Stony Brook University Brian Pendleton Norman Christ Jun-Sik Yoo Antonin Portelli Duo Guo Sergey Syritsyn (RBRC) T obias T sang Christopher Kelly Azusa Yamaguchi Bob Mawhinney York University (Toronto) Masaaki T omii KEK Jiqun Tu Renwick Hudspith Julien Frison Bigeng Wang

SUMMIT at ORNL 3/25 Figure 1: The New York Times’s comment on SUMMIT becoming world’s most powerful supercomputer.

Scaling on SUMMIT at ORNL 4/25 94.2 100 ← single Volta peak performance(projected) 18.60 Tflops/node ← single Volta MDWF dslash performance(projected) 10 8.85 1.55 1 1 10 100 1000 number of nodes Figure 2: Half precision M¨ obius domain wall fermion CG weak scaling with local volume of 16 × 12 3 × 12 . 6 NVIDIA Volta GPUs on each compute node. Numbers provided by Chulwoo Jung.

Motivation 5/25 • Inter-processor communication is the bottleneck for Dirac equation solving. • For measurement there are many approaches available to improve the situation: Lanczos, EigCG, split-grid, multigrid, etc. • Not the case for evolution. • Need an (better) algorithm to reduce the communication overhead and exploit the fascinating local GPU flops. • Do more work locally !

Previous Work 6/25 • Domain Decomposition/Multiplicative Schwarz[ M. L¨ uscher 2004 ]. • Addtive Schwarz[ Y. Osaki 2000 ] and [ R. Babich 2011 ].

Multisplitting Algorithm 7/25 For reference see [ D. O’leary 1985 ]. Ax = b : A l x l + A s x s + A r x r = b s x l x s A l A s A r b s = × x r x = ×

Multisplitting Algorithm 8/25 Solve A l x l + A s x s + A r x r = b s , Rearrange into an iterative form = b s − A l x ( k ) A s x ( k +1) − A r x ( k ) s l r � Ax ( k ) − A s x ( k ) � = b s − s = r ( k ) + A s x ( k ) ≡ ˆ b ( k ) s s For each cycle, • use communication to calculate the right-hand-side ˆ b s . • solve A s x ( k +1) = ˆ b ( k ) locally. s s • the updated solution x ( k +1) will be used to ready the next s cycle. Get A s for each node by chopping off all off-block-diagonal terms: applying zero Dirichlet boundary condition.

M¨ obius Domain Wall Fermion 9/25 Even-odd preconditioning: � � � ψ e � � φ e � − κ b M 4 M 5 eo = , − κ b M 4 M 5 ψ o φ o oe then instead we solve D PC ψ e = ˆ φ e , D PC ≡ M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe , M 4 oe/eo = D w x,y ( b 5 δ s,t + c 5 D 5 ) � � � (1 + γ µ ) U † D w µ,y + (1 − γ µ ) U † x,y = µ,µ δ x − ˆ x,µ δ x +ˆ . µ,y x − ˆ µ Using CG: PC ˆ D † PC D PC ψ e = D † φ e

The Normal Operator 10/25 • 4 hopping terms in the normal operator: A = D † PC D PC = ( M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe ) † ( M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe ) • This means we need to enforce Dirichlet boundary condition on D † PC D PC instead of the individual hopping terms M 4 eo/oe ( D w x,y ). • Need to include the snake terms: terms that hop out of the boundary and hop back. • Seems obvious but not trivial to implement.

The Normal Operator 11/25 The snake terms:

Dslash Implementation 12/25 before 1st hopping term

Dslash Implementation 12/25 after 1st hopping term

Dslash Implementation 12/25 before 2ed hopping term

Dslash Implementation 12/25 after 2ed hopping term

Dslash Implementation 12/25 before 3rd hopping term

Dslash Implementation 12/25 before 4th hopping term

Multisplitting Algorithm 13/25 • The algorithm converges with inclusion of the snake terms. • The convergence rate is slow. • Similar to [ M. L¨ uscher 2004 ] we use its first cycle with zero initial guess as a preconditioner for CG. • We use plain CG for the preconditioner solve. Instead of setting a precision stopping condition we iterate for a fixed number of times( the inner iteration count ).

As a Preconditioner 14/25 r 0 = b − Ax 0 z 0 = M − 1 r 0 p 0 = z 0 k = 0 while have not converged do α k = � r k , z k � / � p k , Ap k � x k +1 = x k + α k p k r k +1 = r k − α k Ap k = r ( k ) + A s x ( k ) z k +1 = M − 1 r k +1 ← A s x ( k +1) s s only first cycle, zero initial guess, iterate a fixed number of times β k = � z k +1 , r k +1 � / � z k , r k � p k +1 = z k +1 + β k p k k = k + 1 end while

As a Preconditioner 15/25 󰁑 M = s A s A

As a Preconditioner 16/25 • Although starting from a different origin, this is now effectively the same with addtive Schwarz if one treats the Dirichlet boundary condition correctly. • Inclusion of the snake terms is crucial. • Naming issue: [ A Unified Representation and Theory of Algebraic Additive Schwarz and Multisplitting Methods , A. Frommer 1997 ]. • Multisplitting Preconditioned CG(MSPCG).

Result: 32 3 × 64 17/25 10 − 02 32x64x12ID, plain CG 10 − 03 r 2 /r 2 0 MSPCG: 3 inner iter. � 10 − 04 MSPCG: 4 inner iter. relative precision MSPCG: 6 inner iter. 10 − 05 10 − 06 10 − 07 10 − 08 0 2000 4000 6000 8000 10000 12000 14000 outer iteration count Figure 3: MSPCG solve on a 32 3 × 64 lattice ( a − 1 = 1 . 37 GeV ) with physical pion mass. Test performed on CORI at NERSC on 128 KNL nodes.

Result: 32 3 × 64 18/25 10 +00 plain CG 2 inner iter. 3 inner iter. 10 − 02 4 inner iter. 5 inner iter. r 2 /s 2 6 inner iter. 10 − 04 7 inner iter. � 8 inner iter. 9 inner iter. % = 10 − 06 10 inner iter. 10 − 08 10 − 10 0 5000 10000 15000 20000 25000 30000 35000 outer iteration count Figure 4: MSPCG solve on the same lattice. Test performed on 64 nodes at Piz Daint. Solving D † Dx = b instead of D † Dx = D † b . Numbers from Kate Clark.

Result: 64 3 × 128 19/25 number of outer iterations to converge 064 nodes 10000 9993 9741 128 nodes 256 nodes 9000 512 nodes 8000 7000 6000 6008 5823 5631 5464 5083 5000 4948 4800 4655 4492 4298 4237 4000 0 3 6 9 12 15 18 number of inner iterations Figure 5: MSPCG solve on a 64 3 × 128 lattice ( a − 1 = 2 . 36 GeV ) with physical pion mass. Plain CG takes 18092 iterations to converge to the same precision( 10 − 10 ). KNL at CORI.

Result: 80 2 × 96 × 192 20/25 10 − 02 80x80x96x192DED, plain CG 10 − 03 r 2 /r 2 0 MSPCG: 6 inner iter. 10 − 04 � 10 − 05 relative precision 10 − 06 10 − 07 10 − 08 10 − 09 10 − 10 0 2000 4000 6000 8000 10000 12000 14000 16000 outer iteration count Figure 6: MSPCG solve on a 80 2 × 96 × 192 lattice ( a − 1 = 3 . 00 GeV ) with physical pion mass. Test performed on CORI at NERSC with 1024 KNL nodes.

Sloppy Preconditioner Solve 21/25 • We observe that the number of iterations for outer CG is greatly reduced even if the inner preconditioner is solved in a sloppy way, e.g. iterating only 3-6 times. • Our observation is supported by several theoretical works, say, [ G. Golub 1999 ] and [ V. Simoncini 2003 ]. • Thus the number of preconditioner solve is a parameter that can be tuned to achieve maximum speed up.

SUMMIT 22/25 For 16 × 12 3 × 12 local volume on 4 Volta GPUs, preconditioner 14 . 13 Tflops With the same local volume on 1024 6 -Volta-nodes, outer 1 . 55 Tflops/GPU Assuming a factor of 3 in outer iteration count reduction with 6 inner iterations, the speed up from MSPCG is: more work from snake terms � 3 �� 6 × 1 . 87 1 + = 1 . 65 1 . 55 14 . 13 × (6 / 4) 1 . 55 � �� precon. cost outer cost

Code Implementation 23/25 • First tested in CPS . • Fully implemented in Grid 1 and Quda 2 with help from Qlattice 3 . • Great thanks to Kate Clark from NVIDIA. 1 https://github.com/paboyle/Grid 2 https://github.com/lattice/quda 3 https://github.com/waterret/Qlattice

Conclusion 24/25 • The amount of inter-processor communication could be reduced at the expense of more local floating point computation by using the multisplitting algorithm as a preconditioner for CG. • If the local floating point computation is cheap enough this greatly speeds up domain wall fermion Dirac equation solving.

Future Work 25/25 • On going work on Quda : Speed up preconditioner dslash as much as possible. • The same approach is expected to work for staggered fermion as well. • Spectrum analysis of the matrix A and the preconditioner M .

Solving Domain Wall Dirac Equation Using Multisplitting - PowerPoint PPT Presentation

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun Tu 1 1 Department of Physics, Columbia University The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo

Outline Outline Itos Equation Itos Equation Fokker Fokker- -Planck

Dirac Materials A.V. Balatsky New class of materials What is the definition of Dirac

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Fourier analysis of Discrete Dirac n torus Nelson Faustino operators on the n torus R n /

String cosmology and String cosmology and String cosmology and the index of the Dirac Dirac

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

CURTAIN WALL OPTIONS UNITIZED CURTAIN WALL Is a type of frame-supporting curtain wall consisting

HAMPTON INN HOTEL RENOVATION CORRIDORS WALL VINYL WINDOW TEATMENTS WALL ART CARPET WALL ART

CURTAIN WALL OPTIONS UNITIZED CURTAIN WALL Is a type of frame-supporting curtain wall consisting

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

The Atiyah-Patodi-Singer index and domain-wall fermion Dirac operators Shinichiroh Matsuo

domain-wall Dirac operator based on arXiv:1607.01099 [hep-lat] JLQCD collaboration: Guido Cossu

Representation of a general composition of Dirac structures Carles Batlle, Imma Massana, Ester

Phase transitions and critical behavior in 2D Dirac materials Laura Classen Heidelberg, March

JUST THE MATHS SLIDES NUMBER 16.6 LAPLACE TRANSFORMS 6 (The Dirac unit impulse function)

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Statistical Learning and Inference Methods for Reasoning in Games March 30, 2005 Brian Mihok

Neutrino diffusion in General Relativity Giovanni Camelio with Stephan Rosswog from Astronomy

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

Solving Domain Wall Dirac Equation Using Multisplitting - PowerPoint PPT Presentation

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun Tu 1 1 Department of Physics, Columbia University The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo

Outline Outline Itos Equation Itos Equation Fokker Fokker- -Planck

Dirac Materials A.V. Balatsky New class of materials What is the definition of Dirac

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Fourier analysis of Discrete Dirac n torus Nelson Faustino operators on the n torus R n /

String cosmology and String cosmology and String cosmology and the index of the Dirac Dirac

Bacteria Without a Cell Wall L-forms Pros &amp; Cons of Cell Wall Cell membrane Cell wall DNA

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

CURTAIN WALL OPTIONS UNITIZED CURTAIN WALL Is a type of frame-supporting curtain wall consisting

HAMPTON INN HOTEL RENOVATION CORRIDORS WALL VINYL WINDOW TEATMENTS WALL ART CARPET WALL ART

CURTAIN WALL OPTIONS UNITIZED CURTAIN WALL Is a type of frame-supporting curtain wall consisting

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

The Atiyah-Patodi-Singer index and domain-wall fermion Dirac operators Shinichiroh Matsuo

domain-wall Dirac operator based on arXiv:1607.01099 [hep-lat] JLQCD collaboration: Guido Cossu

Representation of a general composition of Dirac structures Carles Batlle, Imma Massana, Ester

Phase transitions and critical behavior in 2D Dirac materials Laura Classen Heidelberg, March

JUST THE MATHS SLIDES NUMBER 16.6 LAPLACE TRANSFORMS 6 (The Dirac unit impulse function)

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Statistical Learning and Inference Methods for Reasoning in Games March 30, 2005 Brian Mihok

Neutrino diffusion in General Relativity Giovanni Camelio with Stephan Rosswog from Astronomy

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure &amp; Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016