BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - - PowerPoint PPT Presentation

block solvers for multiple right hand sides on nvidia gpus
SMART_READER_LITE
LIVE PREVIEW

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - - PowerPoint PPT Presentation

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016 QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2 UPDATE ON QUDA QUDA QCD on CUDA open source, BSD license Effort started at


slide-1
SLIDE 1

Mathias Wagner, Lattice 2016

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS

slide-2
SLIDE 2

2

AGENDA

QUDA update Dslash for multiple rhs Block Krylov solvers

slide-3
SLIDE 3

UPDATE ON QUDA

slide-4
SLIDE 4

4

QUDA

Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. Latest release 0.8.0 (8th February 2016) Provides: Various solvers for all major fermionic discretizations, with multi-GPU support Additional performance-critical routines needed for gauge-field generation Maximize performance Exploit physical symmetries to minimize memory traffic Mixed-precision methods Autotuning for high performance on all CUDA-capable architectures Domain-decomposed (Schwarz) preconditioners for strong scaling Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) Multigrid solvers for optimal convergence A research tool for how to reach the exascale

“QCD on CUDA” – open source, BSD license

slide-5
SLIDE 5

5

QUDA - LATTICE QCD ON GPUS

http://lattice.github.com/quda

QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit

include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago CMakeLists.txt added some comments to CMakeLists.txt 3 months ago

45 29 36 Watch Unstar Fork

lattice / quda

Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings 4,621 commits 49 branches 19 releases 16 contributors

Clone or download Clone or download Create new file Upload files Find file develop Branch: New pull request

Latest commit f3e2aa7 a day ago mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference

slide-6
SLIDE 6

QUDA AUTHORS

Ron Babich (NVIDIA) Michael Baldhauf (Regensburg) Kip Barros (LANL) Rich Brower (Boston University) Nuno Cardoso (Lisbon) Kate Clark (NVIDIA) Michael Cheng (Boston University) Carleton DeTar (Utah University) Justin Foley (Utah -> NIH) Joel Giedt (Rensselaer Polytechnic Institute) Arjun Gambhir (William and Mary) Steve Gottlieb (Indiana University) Dean Howarth (Rensselaer Polytechnic Institute) Bálint Joó (Jlab) Hyung-Jin Kim (BNL -> Samsung) Claudio Rebbi (Boston University) Guochun Shi (NCSA -> Google) Mario Schröck (INFN) Alexei Strelchenko (FNAL) Alejandro Vaquero (INFN) Mathias Wagner (NVIDIA) Frank Winter (Jlab)

slide-7
SLIDE 7

SOLVERS FOR MULTIPLE RIGHT HAND SIDES

slide-8
SLIDE 8

8

CONJUGATE GRADIENT

just as a reminder

procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|

2

h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|

2

|r(k)|

2

p(k) = r(k) + β(k−1)p(k−1) end for end procedure

slide-9
SLIDE 9

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7

slide-10
SLIDE 10

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

slide-11
SLIDE 11

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

WILSON CLOVER DSLASH

Volume = 324

500 1000 1500 2000 2500 K80 Half P100 Half K80 Single P100 Single K80 Double P100 Double

GFLOPS

Reconstruct 8 Reconstruct 12 Reconstruct 18

slide-12
SLIDE 12

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats

slide-13
SLIDE 13

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats Reuse gauge field for multiple rhs

slide-14
SLIDE 14

9

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats Reuse gauge field for multiple rhs

reconstruct

arithmetic intensity

0.7 1.4 2.1 2.8

# rhs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 13 9

slide-15
SLIDE 15

10

TESLA PASCAL P100

Tesla P100

for NVLink-enabled Servers

Tesla P100

for PCIe-Based Servers

5.3 TF DP · 10.6 TF SP · 21 TF HP 720 GB/sec Memory Bandwidth 16 GB HBM2 4.7 TF DP · 9.3 TF SP · 18.7 TF HP Config 1: 16 GB (HBM2), 720 GB/sec Config 2: 12 GB (HBM2), 540 GB/sec

slide-16
SLIDE 16

11

NVIDIA TITAN X

11.0 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X 11 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X Pascal architecture

TITAN X

slide-17
SLIDE 17

12

MULTI-SRC DSLASH ON PASCAL

500 1000 1500 2000 1 2 3 4 5 6 8 10 12 14 16

P100 double P100 single Titan X single

GFlOP/s # rhs

Volume 244, HISQ, tuned gauge reconstruct

slide-18
SLIDE 18

12

MULTI-SRC DSLASH ON PASCAL

500 1000 1500 2000 1 2 3 4 5 6 8 10 12 14 16

P100 double P100 single Titan X single

GFlOP/s # rhs

Volume 244, HISQ, tuned gauge reconstruct

slide-19
SLIDE 19

12

MULTI-SRC DSLASH ON PASCAL

500 1000 1500 2000 1 2 3 4 5 6 8 10 12 14 16

P100 double P100 single Titan X single

GFlOP/s # rhs

Volume 244, HISQ, tuned gauge reconstruct

slide-20
SLIDE 20

12

MULTI-SRC DSLASH ON PASCAL

500 1000 1500 2000 1 2 3 4 5 6 8 10 12 14 16

P100 double P100 single Titan X single

GFlOP/s # rhs

Volume 244, HISQ, tuned gauge reconstruct

slide-21
SLIDE 21

13

CONJUGATE GRADIENT

exploit multi-src Dslash performance do all the linear algebra for each rhs same iteration count as CG

using multi-src Dslash

procedure CG WITH MULTI-SRC DSLASH ri = bi Ax(0)

i

p(0)

i

= r(0)

i

for k = 1, 2, . . . until converged do n z(k−1)

i

  • = A
  • p(k−1)

α(k−1)

i

= |r(k−1)

i

|2/h(p(k−1)

i

), z(k−1)

i

i x(k)

i

= x(k−1)

i

+ α(k−1)

i

p(k−1)

i

r(k)

i

= r(k−1)

i

α(k−1)

i

z(k−1)

i

β(k−1)

i

= |r(k−1)

i

|2/|r(k)

i

|2 p(k)

i

= r(k)

i

+ β(k−1)

i

p(k−1)

i

end for end procedure

slide-22
SLIDE 22

14

MULTI-SRC CG ON PASCAL

500 1000 1500 1 2 3 4 5 6 8 10 12 14 16

P100 single

GFlOP/s # rhs

Volume 244, HISQ

slide-23
SLIDE 23

BLOCK KRYLOV SPACE SOLVERS

slide-24
SLIDE 24

16

BLOCK KRYLOV SOLVERS

BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations

Share the Krylov space

slide-25
SLIDE 25

16

BLOCK KRYLOV SOLVERS

BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift)

Share the Krylov space

Nakamura et al., CPC 183 (2012) 34–37

slide-26
SLIDE 26

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

17

slide-27
SLIDE 27

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

1 m n n

17

slide-28
SLIDE 28

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

=

1 m n n

17

slide-29
SLIDE 29

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

= = + + =

1 m n n

17

slide-30
SLIDE 30

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-31
SLIDE 31

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-32
SLIDE 32

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-33
SLIDE 33

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-34
SLIDE 34

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-35
SLIDE 35

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-36
SLIDE 36

18

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-37
SLIDE 37

19

SCALING

Dslash exploits reuse of gauge field

1 3 5 8 12 16

slide-38
SLIDE 38

19

SCALING

Dslash exploits reuse of gauge field Linear Algebra number of dot products scales quadratically number of axpy calls scales quadratically yi = X aijxj + yi αij = hxi, xji

1 3 5 8 12 16

slide-39
SLIDE 39

19

SCALING

Dslash exploits reuse of gauge field Linear Algebra number of dot products scales quadratically number of axpy calls scales quadratically yi = X aijxj + yi αij = hxi, xji

1 3 5 8 12 16

slide-40
SLIDE 40

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-41
SLIDE 41

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-42
SLIDE 42

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-43
SLIDE 43

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-44
SLIDE 44

cache reuse

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-45
SLIDE 45

cache reuse

20

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

  • y0(1) = a00x

y1(1) = a10x y2(1) = a20x y3(1) = a30x

slide-46
SLIDE 46

21

ORTHOGONALIZATION

simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive

BlockCG is not always numerically stable

Cost per iteration

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

slide-47
SLIDE 47

21

ORTHOGONALIZATION

simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive

BlockCG is not always numerically stable

Cost per iteration

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

slide-48
SLIDE 48

21

ORTHOGONALIZATION

simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive

BlockCG is not always numerically stable

Cost per iteration

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

slide-49
SLIDE 49

22

ORTHOGONALIZATION

simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive CholQR Gram-Matrix: dot products of length n Cholesky Decomposition

  • f matrix

apply to vectors axpy (output, input) relies on the same kernel as vector operations: can get linear scaling B = RHR SHS = B m × m m × m m × m Q = RS−1

slide-50
SLIDE 50

23

COST OF ONE ITERATION

for one rhs

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

projected

slide-51
SLIDE 51

23

COST OF ONE ITERATION

for one rhs

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

projected

large benefits from multi-src Dslash

slide-52
SLIDE 52

23

COST OF ONE ITERATION

for one rhs

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

projected

large benefits from multi-src Dslash

slide-53
SLIDE 53

23

COST OF ONE ITERATION

for one rhs

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

projected

large benefits from multi-src Dslash linear algebra and orthogonalization stay constant

slide-54
SLIDE 54

23

COST OF ONE ITERATION

for one rhs

Cost [A.U.] # rhs

2 4 8 12 16

Dslash Vector operation Orthogonalization

projected

large benefits from multi-src Dslash linear algebra and orthogonalization stay constant relative importance of Dslash reduces

slide-55
SLIDE 55

24

SUMMARY

slide-56
SLIDE 56

25

WORK TO BE DONE

stability needs real world testing


  • rthogonalization might be necessary

iteration count improvement may depend on gauge field and sources need to finish up implementation add mixed precision

your milage may vary

residual norm

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00

iterations

1000 2000 3000 4000

1 2 4 8 12 16

slide-57
SLIDE 57

26

SPEEDUP OVER CG

multi-rhs Block Solvers provide an easy drop in

speedup per rhs

2.5 5 7.5 10

# rhs

1 2 4 8 12 16

CG MSRC CG Block CG BlockCG rQ

projected

slide-58
SLIDE 58

26

SPEEDUP OVER CG

reuse gauge field for Dslash

multi-rhs Block Solvers provide an easy drop in

speedup per rhs

2.5 5 7.5 10

# rhs

1 2 4 8 12 16

CG MSRC CG Block CG BlockCG rQ

projected

slide-59
SLIDE 59

26

SPEEDUP OVER CG

reuse gauge field for Dslash reduced iteration count

multi-rhs Block Solvers provide an easy drop in

speedup per rhs

2.5 5 7.5 10

# rhs

1 2 4 8 12 16

CG MSRC CG Block CG BlockCG rQ

projected

slide-60
SLIDE 60

26

SPEEDUP OVER CG

reuse gauge field for Dslash reduced iteration count

multi-rhs Block Solvers provide an easy drop in

speedup per rhs

2.5 5 7.5 10

# rhs

1 2 4 8 12 16

CG MSRC CG Block CG BlockCG rQ

projected

slide-61
SLIDE 61

26

SPEEDUP OVER CG

reuse gauge field for Dslash reduced iteration count avoid quadratical scaling in linear algebra and orthogonalization no memory overhead / setup cost speedups up to 10x

multi-rhs Block Solvers provide an easy drop in

speedup per rhs

2.5 5 7.5 10

# rhs

1 2 4 8 12 16

CG MSRC CG Block CG BlockCG rQ

projected

slide-62
SLIDE 62