SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - - PowerPoint PPT Presentation
SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, - - PowerPoint PPT Presentation
SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387 HPC BEYOND MOORES LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores
2
HPC BEYOND MOORE’S LAW
CPUs and GPUs becoming wider increase in flops is driven by more cores also applies for CPUs (Server to mobile) need sufficient amount of parallelism to fill architectures
going wide
CUDA cores
1000 2000 3000 4000 Tesla Fermi Kepler Maxwell Pascal
3
HPC BEYOND MOORE’S LAW
CPUs and GPUs becoming wider increase in flops is driven by more cores need sufficient amount of parallelism to fill architectures
staying local
FLOPS Byte 2 4 6 8 10
Tesla GPU Intel CPU Intel MIC
4
AGENDA
Lattice Quantum Chromodynamics Classic Conjugate Gradient Solver Block Krylov solvers Efficient implementation Time to solution improvements in the wild
6
LATTICE QCD: SOME BASICS
QCD partition function 4 dimensional grid (=Lattice) quarks live on lattice sites gluons live on the links typical sizes: 243 x 6 to 2564 parallelization over lattice sites (105 to 109)
ZQCD (T, µ) = Z DAD ¯ ΨDΨe−SE(T,µ)
includes integral over space and time
7
QUDA - LATTICE QCD ON GPUS
http://lattice.github.com/quda
QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit
include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago CMakeLists.txt added some comments to CMakeLists.txt 3 months ago
45 29 36 Watch Unstar Fork
lattice / quda
Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings 4,621 commits 49 branches 19 releases 16 contributors
Clone or download Clone or download Create new file Upload files Find file develop Branch: New pull request
Latest commit f3e2aa7 a day ago mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference
…
QUDA AUTHORS
Ron Babich (NVIDIA) Michael Baldhauf (Regensburg) Kip Barros (LANL) Rich Brower (Boston University) Nuno Cardoso (Lisbon) Kate Clark (NVIDIA) Michael Cheng (Boston University) Carleton DeTar (Utah University) Justin Foley (NIH) Joel Giedt (Rensselaer Polytechnic Institute) Arjun Gambhir (William and Mary) Steve Gottlieb (Indiana University) Dean Howarth (Rensselaer Polytechnic Institute) Bálint Joó (Jlab) Hyung-Jin Kim (BNL -> Samsung) Claudio Rebbi (Boston University) Guochun Shi (NCSA -> Google) Mario Schröck (INFN) Alexei Strelchenko (FNAL) Alejandro Vaquero (Utah University) Mathias Wagner (NVIDIA) Evan Weinberg (Boston University) Frank Winter (Jlab)
Green: Contributors to this talk
THE OLD WORK HORSE
10
CONJUGATE GRADIENT
procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|
2
h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|
2
|r(k)|
2
p(k) = r(k) + β(k−1)p(k−1) end for end procedure
10
CONJUGATE GRADIENT
procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|
2
h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|
2
|r(k)|
2
p(k) = r(k) + β(k−1)p(k−1) end for end procedure
matrix-vector operation dominates runtime
10
CONJUGATE GRADIENT
procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|
2
h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|
2
|r(k)|
2
p(k) = r(k) + β(k−1)p(k−1) end for end procedure
matrix-vector operation dominates runtime caxpy BLAS operations caxpy BLAS operations
10
CONJUGATE GRADIENT
procedure CG r(0) = b Ax(0) p(0) = r(0) for k = 1, 2, . . . until converged do z(k−1) = Ap(k−1) α(k−1) = |r(k−1)|
2
h(p(k−1)),z(k−1)i x(k) = x(k 1) + α(k−1)p(k−1) r(k) = r(k−1) α(k−1)z(k−1) β(k−1) = |r(k−1)|
2
|r(k)|
2
p(k) = r(k) + β(k−1)p(k−1) end for end procedure
matrix-vector operation dominates runtime caxpy BLAS operations caxpy BLAS operations Reductions Reductions
11
STAGGERED FERMION MATRIX (DSLASH)
Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%)
11
STAGGERED FERMION MATRIX (DSLASH)
Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil
wx = Dx,x0vx0 =
3
X
µ=0
hn Ux,µvx+ˆ
µ − U † x−ˆ µ,µvx−ˆ µ
- +
n Nx,µvx+3ˆ
µ − N † x−3ˆ µ,µvx−3ˆ µ
- i
complex 3x3 matrix 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector 24 byte for fp32 complex 3-dim vector 24 byte for fp32
11
STAGGERED FERMION MATRIX (DSLASH)
Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte
wx = Dx,x0vx0 =
3
X
µ=0
hn Ux,µvx+ˆ
µ − U † x−ˆ µ,µvx−ˆ µ
- +
n Nx,µvx+3ˆ
µ − N † x−3ˆ µ,µvx−3ˆ µ
- i
complex 3x3 matrix 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector 24 byte for fp32 complex 3-dim vector 24 byte for fp32
11
STAGGERED FERMION MATRIX (DSLASH)
Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte
sensitive to memory bandwidth
wx = Dx,x0vx0 =
3
X
µ=0
hn Ux,µvx+ˆ
µ − U † x−ˆ µ,µvx−ˆ µ
- +
n Nx,µvx+3ˆ
µ − N † x−3ˆ µ,µvx−3ˆ µ
- i
complex 3x3 matrix 72 byte for fp32 complex 3x3 matrix 72 byte for fp32 complex 3-dim vector 24 byte for fp32 complex 3-dim vector 24 byte for fp32
THE NEW TRACTOR
13
BLOCK KRYLOV SPACE SOLVERS
BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations
Share the Krylov space
13
BLOCK KRYLOV SPACE SOLVERS
BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift)
Share the Krylov space
Nakamura et al., CPC 183 (2012) 34–37
BLOCK CG
share Krylov space between multiple rhs
procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure
14
BLOCK CG
share Krylov space between multiple rhs
procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure
1 m n n
14
BLOCK CG
share Krylov space between multiple rhs
procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure
=
1 m n n
14
BLOCK CG
share Krylov space between multiple rhs
procedure BLOCKCG R(0) = B − AX(0) P (0) = R(0) for k = 1, 2, . . . until converged do Z(k−1) = AP (k−1) α(k−1) = ⇥ (P (k−1))HZ(k−1)⇤−1 (R(k−1))HR(k−1) X(k) = X(k−1) + P (k−1)α(k−1) R(k) = R(k−1) − Z(k−1)α(k−1) β(k−1) = ⇥ (R(k−1))HR(k−1)⇤−1 (R(k))HR(k) P (k) = R(k) − P (k−1)β(k−1) end for end procedure
= = + + =
1 m n n
14
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
15
REDUCED ITERATION COUNT
HISQ, 323x8, Gaussian random source
1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
residual iterations
BLOCK CGRQ
Prevent numerical instabilities through orthogonalization
16
A ∈ CL×L; R, B, X, Q ∈ CL×N; C, S, ∈ CN×N procedure BLOCKCGRQ(XL×N, BL×N) R = B − AX QC = R . QR decomposition S = IN×N P = 0 while not converged do P = Q + PS† =
- P †AP
−1 X = X + PC QS = Q − AP . QR decomposition C = SC end while end procedure
17
ORTHOGONALIZATION
simple approach: Gram-Schmidt or modified Gram-Schmidt becomes prohibitively expensive THINQR Gram-Matrix: dot products of length n Cholesky Decomposition
- f matrix
apply to vectors axpy (output, input) relies on BLAS operations and reductions. Cholesksky Decompostion on CPU.
THIN QR
B = RHR SHS = B m × m m × m m × m Q = RS−1
18
REDUCTION
BLAS DSLASH
INCREASED COST PER ITERATION
apply Dslash to m rhs naively scales linear constant time per rhs Instead of dot product between to vectors we now need to evaluate m x m dot products Quadratic scaling with m used for orthogonalization linear scaling of cost per rhs caxpy couples m x m input vectors via m x m matrix a Quadratic scaling with m used for orthogonalization linear scaling of cost per rhs
19
COST PER ITERATION
for one rhs
relative cost (normalized to one naive 1 rhs) # rhs
0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
naive
DSLASH FOR MULTIPLE RHS
21
QCD PERFORMANCE LIMITERS
QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats
21
QCD PERFORMANCE LIMITERS
QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Constant gauge field (A) for multiple rhs Reuse gauge field for multiple rhs
21
QCD PERFORMANCE LIMITERS
QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7-0.8 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Constant gauge field (A) for multiple rhs Reuse gauge field for multiple rhs
reconstruct
arithmetic intensity
0.7 1.4 2.1 2.8
# rhs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 13 9
22
MULTI-SRC DSLASH
GFlOP/s # rhs
Volume 244, HISQ
1,000 2,000 3,000 4,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
double single half double roofline single roofline half roofline
For roofline model: Assume 100% cache reuse in x-direction, 50% in ydirection
23
WHY DON’T WE SEE EXPECTED SCALING
1 rhs Dslash agrees with roofline
23
WHY DON’T WE SEE EXPECTED SCALING
1 rhs Dslash agrees with roofline
23
WHY DON’T WE SEE EXPECTED SCALING
1 rhs Dslash agrees with roofline
Device memory bandwidth is performance limiter
24
CACHE BASED MULTI RHS DSLASH
Check extreme cases with 16 rhs
24
CACHE BASED MULTI RHS DSLASH
Check extreme cases with 16 rhs
High number of memory
- perations
24
CACHE BASED MULTI RHS DSLASH
Check extreme cases with 16 rhs
High number of memory
- perations
24
CACHE BASED MULTI RHS DSLASH
Check extreme cases with 16 rhs
High number of memory
- perations
Unified Cache is limiting performance
25
REGISTER OPTIMIZATION
Use y-dimension of CUDA blocks for rhs One lattice site is scheduled on the same SM for all rhs Each thread processes one rhs on one lattice site Necessary for cache reuse of gauge field
Hybrid implementation
Cache
25
REGISTER OPTIMIZATION
Use y-dimension of CUDA blocks for rhs One lattice site is scheduled on the same SM for all rhs Each thread processes one rhs on one lattice site Necessary for cache reuse of gauge field Use y -dimension of CUDA blocks for rhs One lattice site is schedule on the same SM for all rhs Each thread processes multiple rhs on one lattice site (reuse gauge field from registers) Reduces cache pressure
Hybrid implementation
Cache Register + Cache
26
REGISTER REUSE DSLASH
Check extreme cases with 16 rhs
26
REGISTER REUSE DSLASH
Check extreme cases with 16 rhs
Medium number of memory
- perations
26
REGISTER REUSE DSLASH
Check extreme cases with 16 rhs
Medium number of memory
- perations
26
REGISTER REUSE DSLASH
Check extreme cases with 16 rhs
Medium number of memory
- perations
Unified Cache is no longer limiting performance
27
MULTI-SRC DSLASH ON PASCAL
GFlOP/s # rhs
Volume 244, HISQ, tuned gauge reconstruct
1,000 2,000 3,000 4,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
double double reg single single reg half half reg double roofline single roofline half roofline
28
QUDA AUTOTUNER
How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1)
runtime tuning of launch parameters and more
28
QUDA AUTOTUNER
How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1) QUDA already uses an autotuner at runtime to determine optimal launch parameters For 1st call of each kernel and parameters like volume Cached on disk for reuse in subsequent calculations
runtime tuning of launch parameters and more
28
QUDA AUTOTUNER
How to decide how to divide between tex and register reuse? E.g. for 16 rhs possible combinations are (1,16), (2,8), (4,4), (8,2), (16,1) QUDA already uses an autotuner at runtime to determine optimal launch parameters For 1st call of each kernel and parameters like volume Cached on disk for reuse in subsequent calculations Can also be used for auxiliary parameters here: number of rhs in registers
runtime tuning of launch parameters and more
29
COST PER ITERATION
for one rhs
relative cost (normalized to one naive 1 rhs) # rhs
0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
naive multi rhs Dslash
OPTIMIZING LINEAR ALGEBRA
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi
(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .
(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .
(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
cache reuse
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .
(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
cache reuse
31
EXPLOIT GPU ARCHITECTURE
CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory
to overcome quadratically scaling
yi = X aijxj + yi y0(0) = a00x0(0) + a01x1(0) + . . . y1(0) = a10x0(0) + a11x1(0) + . . . y2(0) = a20x0(0) + a21x1(0) + . . . y3(0) = a30x0(0) + a31x1(0) + . . .
(0,0) (0,3) (0,1) (0,4) (1,0) (1,2) (1,1) (1,3) (2,0) (2,2) (2,1) (2,3) (3,0) (3,2) (3,1) (3,4)
+ =
units aligns better with GP100’s new datapath configuration, ‐ ‐
- y0(1) = a00x
y1(1) = a10x y2(1) = a20x y3(1) = a30x
32
MULTI-BLAS
caxpy
Time
0.001 0.003 0.004 0.005
Block size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
caxpy Linear Quadratic
32
MULTI-BLAS
caxpy
Time
0.001 0.003 0.004 0.005
Block size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
caxpy Linear Quadratic
GB/s 550 1100 1650 2200
GFLOPS
500 1000 1500 2000
Block size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
GFLOPS GB/s
33
MULTI-REDUCTION
cdot
0.003 0.006 0.009 0.012 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time Linear Quadratic
33
MULTI-REDUCTION
cdot
0.003 0.006 0.009 0.012 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time Linear Quadratic
GB/s
1000 2000 3000 4000
GFLOPS
0.00 450.00 900.00 1350.00 1800.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GFLOPS GB/s
34
COST PER ITERATION
for one rhs
relative cost (normalized to one naive 1 rhs) # rhs
0.333 0.667 1 1.333 1.667 2 2.333 2.667 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
naive multi rhs Dslash full multi
35
BLOCKCGRQ IN THE WILD
36
ACTUAL PHYSICS PROJECT
Calculations inherently need to solve multiples of 8 rhs Use a two-step solution process (sloppy-solve + refine) 7x speedup for production runs
G-2 project of the MILC collaboration
speedup over sequential CG
0.0 1.8 3.5 5.3 7.0
Number of rhs
1 8 16 24 32 48 64
double
37
SUMMARY
38
BLOCKCG RQ
more parallelism and lots of locality to exploit
GB/s 550 1100 1650 2200
GFLOPS
500 1000 1500 2000
Block size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
GFLOPS GB/s
38
BLOCKCG RQ
Reuse gauge field for Dslash Reuse vectors in BLAS and reductions
more parallelism and lots of locality to exploit
GB/s 550 1100 1650 2200
GFLOPS
500 1000 1500 2000
Block size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
GFLOPS GB/s
38
BLOCKCG RQ
Reuse gauge field for Dslash Reuse vectors in BLAS and reductions avoid quadratical scaling in linear algebra and orthogonalization creates more parallelism to saturate wider architectures squeezes 4x number of FLOPS out of a P100
more parallelism and lots of locality to exploit
GFLOPS
225 450 675 900
rhs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
naive multi rhs Dslash full multi
39
TIME TO SOLUTION
about a factor 3 improvement in cost/ iteration additional (free) flops in BLAS/ reduction drive reduced iteration count
Combined effect of reduced iteration count and cost per iteration
Residual
1.0E-09 1.0E-08 1.0E-07 1.0E-06 1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 1.0E+00
iterations
500 1000 1500 2000 2500 3000 3500 4000 4500
1 2 4 8 12 16
39
TIME TO SOLUTION
about a factor 3 improvement in cost/ iteration additional (free) flops in BLAS/ reduction drive reduced iteration count reduced iteration count depends on rhs, target residual and matrix condition Immediate drop in for CG solver No overhead costs
Combined effect of reduced iteration count and cost per iteration
speedup over sequential CG
0.0 1.8 3.5 5.3 7.0
Number of rhs
1 8 16 24 32 48 64
double