SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - - PowerPoint PPT Presentation

speeding up conjugate gradient solvers by 10x
SMART_READER_LITE
LIVE PREVIEW

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - - PowerPoint PPT Presentation

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387 HPC BEYOND MOORES LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores


slide-1
SLIDE 1

Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387

SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X

slide-2
SLIDE 2

2

HPC BEYOND MOORE’S LAW

CPUs and GPUs becoming wider increase in flops is driven by more cores also applies for CPUs (Server to mobile) need sufficient amount of parallelism to fill architectures

going wide

CUDA cores

1000 2000 3000 4000 Tesla Fermi Kepler Maxwell Pascal

slide-3
SLIDE 3

3

HPC BEYOND MOORE’S LAW

CPUs and GPUs becoming wider increase in flops is driven by more cores need sufficient amount of parallelism to fill architectures

staying local

FLOPS Byte 2 4 6 8 10

Tesla GPU Intel CPU Intel MIC

slide-4
SLIDE 4

4

AGENDA

Lattice Quantum Chromodynamics Classic Conjugate Gradient Solver Block Krylov solvers Efficient implementation Time to solution improvements in the wild

slide-5
SLIDE 5

6

LATTICE QCD: SOME BASICS

QCD partition function 4 dimensional grid (=Lattice) quarks live on lattice sites gluons live on the links typical sizes: 243 x 6 to 2564 parallelization over lattice sites 
 (105 to 109)

ZQCD (T, µ) = Z DAD ¯ ΨDΨe−SE(T,µ)

includes integral over space and time

slide-6
SLIDE 6

7

QUDA - LATTICE QCD ON GPUS

http://lattice.github.com/quda

QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit

include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago CMakeLists.txt added some comments to CMakeLists.txt 3 months ago

45 29 36 Watch Unstar Fork

lattice / quda

Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings 4,621 commits 49 branches 19 releases 16 contributors

Clone or download Clone or download Create new file Upload files Find file develop Branch: New pull request

Latest commit f3e2aa7 a day ago mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference

slide-7
SLIDE 7

QUDA AUTHORS

Ron Babich (NVIDIA) Michael Baldhauf (Regensburg) Kip Barros (LANL) Rich Brower (Boston University) Nuno Cardoso (Lisbon) Kate Clark (NVIDIA) Michael Cheng (Boston University) Carleton DeTar (Utah University) Justin Foley (NIH) Joel Giedt (Rensselaer Polytechnic Institute) Arjun Gambhir (William and Mary) Steve Gottlieb (Indiana University) Dean Howarth (Rensselaer Polytechnic Institute) Bálint Joó (Jlab) Hyung-Jin Kim (BNL -> Samsung) Claudio Rebbi (Boston University) Guochun Shi (NCSA -> Google) Mario Schröck (INFN) Alexei Strelchenko (FNAL) Alejandro Vaquero (Utah University) Mathias Wagner (NVIDIA) Evan Weinberg (Boston University) Frank Winter (Jlab)

Green: Contributors to this talk

slide-8
SLIDE 8

THE OLD WORK HORSE

slide-9
SLIDE 9

10

CONJUGATE GRADIENT

procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|

2

h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|

2

|r(k)|

2

p(k) = r(k) + β(k−1)p(k−1) end for end procedure

slide-10
SLIDE 10

10

CONJUGATE GRADIENT

procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|

2

h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|

2

|r(k)|

2

p(k) = r(k) + β(k−1)p(k−1) end for end procedure

matrix-vector operation dominates runtime

slide-11
SLIDE 11

10

CONJUGATE GRADIENT

procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|

2

h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|

2

|r(k)|

2

p(k) = r(k) + β(k−1)p(k−1) end for end procedure

matrix-vector operation dominates runtime caxpy BLAS operations caxpy BLAS operations

slide-12
SLIDE 12

10

CONJUGATE GRADIENT

procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|

2

h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|

2

|r(k)|

2

p(k) = r(k) + β(k−1)p(k−1) end for end procedure

matrix-vector operation dominates runtime caxpy BLAS operations caxpy BLAS operations Reductions Reductions

slide-13
SLIDE 13

11

STAGGERED FERMION MATRIX (DSLASH)

Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%)

slide-14
SLIDE 14

11

STAGGERED FERMION MATRIX (DSLASH)

Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil
 
 
 
 


wx = Dx,x0vx0 =

3

X

µ=0

hn Ux,µvx+ˆ

µ − U † x−ˆ µ,µvx−ˆ µ

  • +

n Nx,µvx+3ˆ

µ − N † x−3ˆ µ,µvx−3ˆ µ

  • i

complex 3x3 matrix
 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector
 24 byte for fp32 complex 3-dim vector
 24 byte for fp32

slide-15
SLIDE 15

11

STAGGERED FERMION MATRIX (DSLASH)

Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil
 
 
 
 
 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte

wx = Dx,x0vx0 =

3

X

µ=0

hn Ux,µvx+ˆ

µ − U † x−ˆ µ,µvx−ˆ µ

  • +

n Nx,µvx+3ˆ

µ − N † x−3ˆ µ,µvx−3ˆ µ

  • i

complex 3x3 matrix
 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector
 24 byte for fp32 complex 3-dim vector
 24 byte for fp32

slide-16
SLIDE 16

11

STAGGERED FERMION MATRIX (DSLASH)

Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil
 
 
 
 
 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte

sensitive to memory bandwidth

wx = Dx,x0vx0 =

3

X

µ=0

hn Ux,µvx+ˆ

µ − U † x−ˆ µ,µvx−ˆ µ

  • +

n Nx,µvx+3ˆ

µ − N † x−3ˆ µ,µvx−3ˆ µ

  • i

complex 3x3 matrix
 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector
 24 byte for fp32 complex 3-dim vector
 24 byte for fp32

slide-17
SLIDE 17

THE NEW TRACTOR

slide-18
SLIDE 18

13

BLOCK KRYLOV SPACE SOLVERS

BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations

Share the Krylov space

slide-19
SLIDE 19

13

BLOCK KRYLOV SPACE SOLVERS

BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift)

Share the Krylov space

Nakamura et al., CPC 183 (2012) 34–37

slide-20
SLIDE 20

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

14

slide-21
SLIDE 21

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

1 m n n

14

slide-22
SLIDE 22

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

=

1 m n n

14

slide-23
SLIDE 23

BLOCK CG

share Krylov space between multiple rhs

procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure

= = + + =

1 m n n

14

slide-24
SLIDE 24

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-25
SLIDE 25

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-26
SLIDE 26

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-27
SLIDE 27

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-28
SLIDE 28

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-29
SLIDE 29

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-30
SLIDE 30

15

REDUCED ITERATION COUNT

HISQ, 323x8, Gaussian random source

1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

residual iterations

slide-31
SLIDE 31

BLOCK CGRQ

Prevent numerical instabilities through orthogonalization

16

A ∈ CL×L; R, B, X, Q ∈ CL×N; C, S, ∈ CN×N procedure BLOCKCGRQ(XL×N, BL×N) R = B − AX QC = R . QR decomposition S = IN×N P = 0 while not converged do P = Q + PS† =

  • P †AP

−1 X = X + PC QS = Q − AP . QR decomposition C = SC end while end procedure

slide-32
SLIDE 32

17

ORTHOGONALIZATION

simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive THINQR Gram-Matrix: dot products of length n Cholesky Decomposition

  • f matrix

apply to vectors axpy (output, input) relies on BLAS operations and reductions. Cholesksky Decompostion on CPU.

THIN QR

B = RHR SHS = B m × m m × m m × m Q = RS−1

slide-33
SLIDE 33

18

REDUCTION

BLAS DSLASH

INCREASED COST PER ITERATION

apply Dslash to m rhs naively scales linear constant time per rhs Instead of dot product between to vectors we now need to evaluate m x m dot products Quadratic scaling with m used for orthogonalization linear scaling of cost per rhs caxpy couples m x m input vectors via m x m matrix a Quadratic scaling with m used for orthogonalization linear scaling of cost per rhs

slide-34
SLIDE 34

19

COST PER ITERATION

for one rhs

relative cost (normalized to one naive 1 rhs) # rhs

0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

naive

slide-35
SLIDE 35

DSLASH FOR MULTIPLE RHS

slide-36
SLIDE 36

21

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

slide-37
SLIDE 37

21

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Constant gauge field (A) for multiple rhs Reuse gauge field for multiple rhs

slide-38
SLIDE 38

21

QCD PERFORMANCE LIMITERS

QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Constant gauge field (A) for multiple rhs Reuse gauge field for multiple rhs

reconstruct

arithmetic intensity

0.7 1.4 2.1 2.8

# rhs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 13 9

slide-39
SLIDE 39

22

MULTI-SRC DSLASH

GFlOP/s # rhs

Volume 244, HISQ

1,000 2,000 3,000 4,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

double single half double roofline single roofline half roofline

For roofline model: Assume 100% cache reuse in x-direction, 50% in ydirection

slide-40
SLIDE 40

23

WHY DON’T WE SEE EXPECTED SCALING

1 rhs Dslash agrees with roofline

slide-41
SLIDE 41

23

WHY DON’T WE SEE EXPECTED SCALING

1 rhs Dslash agrees with roofline

slide-42
SLIDE 42

23

WHY DON’T WE SEE EXPECTED SCALING

1 rhs Dslash agrees with roofline

Device memory bandwidth is performance limiter

slide-43
SLIDE 43

24

CACHE BASED MULTI RHS DSLASH

Check extreme cases with 16 rhs

slide-44
SLIDE 44

24

CACHE BASED MULTI RHS DSLASH

Check extreme cases with 16 rhs

High number of memory

  • perations
slide-45
SLIDE 45

24

CACHE BASED MULTI RHS DSLASH

Check extreme cases with 16 rhs

High number of memory

  • perations
slide-46
SLIDE 46

24

CACHE BASED MULTI RHS DSLASH

Check extreme cases with 16 rhs

High number of memory

  • perations

Unified Cache is limiting performance

slide-47
SLIDE 47

25

REGISTER OPTIMIZATION

Use y-dimension of CUDA blocks for rhs One lattice site is scheduled on the same SM for all rhs Each thread processes one rhs on one lattice site Necessary for cache reuse of gauge field

Hybrid implementation

Cache

slide-48
SLIDE 48

25

REGISTER OPTIMIZATION

Use y-dimension of CUDA blocks for rhs One lattice site is scheduled on the same SM for all rhs Each thread processes one rhs on one lattice site Necessary for cache reuse of gauge field Use y -dimension of CUDA blocks for rhs One lattice site is schedule on the same SM for all rhs Each thread processes multiple rhs on one lattice site (reuse gauge field from registers) Reduces cache pressure

Hybrid implementation

Cache Register + Cache

slide-49
SLIDE 49

26

REGISTER REUSE DSLASH

Check extreme cases with 16 rhs

slide-50
SLIDE 50

26

REGISTER REUSE DSLASH

Check extreme cases with 16 rhs

Medium number of memory

  • perations
slide-51
SLIDE 51

26

REGISTER REUSE DSLASH

Check extreme cases with 16 rhs

Medium number of memory

  • perations
slide-52
SLIDE 52

26

REGISTER REUSE DSLASH

Check extreme cases with 16 rhs

Medium number of memory

  • perations

Unified Cache is no longer limiting performance

slide-53
SLIDE 53

27

MULTI-SRC DSLASH ON PASCAL

GFlOP/s # rhs

Volume 244, HISQ, tuned gauge reconstruct

1,000 2,000 3,000 4,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

double double reg single single reg half half reg double roofline single roofline half roofline

slide-54
SLIDE 54

28

QUDA AUTOTUNER

How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1)

runtime tuning of launch parameters and more

slide-55
SLIDE 55

28

QUDA AUTOTUNER

How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1) QUDA already uses an autotuner at runtime to determine optimal launch parameters For 1st call of each kernel and parameters like volume Cached on disk for reuse in subsequent calculations

runtime tuning of launch parameters and more

slide-56
SLIDE 56

28

QUDA AUTOTUNER

How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1) QUDA already uses an autotuner at runtime to determine optimal launch parameters For 1st call of each kernel and parameters like volume Cached on disk for reuse in subsequent calculations Can also be used for auxiliary parameters here: number of rhs in registers

runtime tuning of launch parameters and more

slide-57
SLIDE 57

29

COST PER ITERATION

for one rhs

relative cost (normalized to one naive 1 rhs) # rhs

0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

naive multi rhs Dslash

slide-58
SLIDE 58

OPTIMIZING LINEAR ALGEBRA

slide-59
SLIDE 59

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-60
SLIDE 60

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-61
SLIDE 61

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-62
SLIDE 62

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-63
SLIDE 63

cache reuse

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

slide-64
SLIDE 64

cache reuse

31

EXPLOIT GPU ARCHITECTURE

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

to overcome quadratically scaling

yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .

(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)

+ =

units aligns better with GP100’s new datapath configuration, ‐ ‐

  • y0(1) = a00x

y1(1) = a10x y2(1) = a20x y3(1) = a30x

slide-65
SLIDE 65

32

MULTI-BLAS

caxpy

Time

0.001 0.003 0.004 0.005

Block size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

caxpy Linear Quadratic

slide-66
SLIDE 66

32

MULTI-BLAS

caxpy

Time

0.001 0.003 0.004 0.005

Block size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

caxpy Linear Quadratic

GB/s 550 1100 1650 2200

GFLOPS

500 1000 1500 2000

Block size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

GFLOPS GB/s

slide-67
SLIDE 67

33

MULTI-REDUCTION

cdot

0.003 0.006 0.009 0.012 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Time Linear Quadratic

slide-68
SLIDE 68

33

MULTI-REDUCTION

cdot

0.003 0.006 0.009 0.012 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Time Linear Quadratic

GB/s

1000 2000 3000 4000

GFLOPS

0.00 450.00 900.00 1350.00 1800.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GFLOPS GB/s

slide-69
SLIDE 69

34

COST PER ITERATION

for one rhs

relative cost (normalized to one naive 1 rhs) # rhs

0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

naive multi rhs Dslash full multi

slide-70
SLIDE 70

35

BLOCKCGRQ IN THE WILD

slide-71
SLIDE 71

36

ACTUAL PHYSICS PROJECT

Calculations inherently need to solve multiples of 8 rhs Use a two-step solution process
 (sloppy-solve + refine) 7x speedup for production runs

G-2 project of the MILC collaboration

speedup over sequential CG

0.0 1.8 3.5 5.3 7.0

Number of rhs

1 8 16 24 32 48 64

double

slide-72
SLIDE 72

37

SUMMARY

slide-73
SLIDE 73

38

BLOCKCG RQ

more parallelism and lots of locality to exploit

GB/s 550 1100 1650 2200

GFLOPS

500 1000 1500 2000

Block size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

GFLOPS GB/s

slide-74
SLIDE 74

38

BLOCKCG RQ

Reuse gauge field for Dslash Reuse vectors in BLAS and reductions

more parallelism and lots of locality to exploit

GB/s 550 1100 1650 2200

GFLOPS

500 1000 1500 2000

Block size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

GFLOPS GB/s

slide-75
SLIDE 75

38

BLOCKCG RQ

Reuse gauge field for Dslash Reuse vectors in BLAS and reductions avoid quadratical scaling in linear algebra and orthogonalization creates more parallelism to saturate wider architectures squeezes 4x number of FLOPS out of a P100

more parallelism and lots of locality to exploit

GFLOPS

225 450 675 900

rhs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

naive multi rhs Dslash full multi

slide-76
SLIDE 76

39

TIME TO SOLUTION

about a factor 3 improvement in cost/ iteration additional (free) flops in BLAS/ reduction drive reduced iteration count

Combined effect of reduced iteration count and cost per iteration

Residual

1.0E-09 1.0E-08 1.0E-07 1.0E-06 1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 1.0E+00

iterations

500 1000 1500 2000 2500 3000 3500 4000 4500

1 2 4 8 12 16

slide-77
SLIDE 77

39

TIME TO SOLUTION

about a factor 3 improvement in cost/ iteration additional (free) flops in BLAS/ reduction drive reduced iteration count reduced iteration count depends on rhs, target residual and matrix condition Immediate drop in for CG solver No overhead costs

Combined effect of reduced iteration count and cost per iteration

speedup over sequential CG

0.0 1.8 3.5 5.3 7.0

Number of rhs

1 8 16 24 32 48 64

double

slide-78
SLIDE 78