SLIDE 1 Raising the Arithmetic Intensity of Krylov solvers
1Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3Intel ExaScience Lab, Belgium 4USI, Lugano, CH
- B. Reps1, P. Ghysels2, S. Donfack4, O. Schenk4,
- W. Vanroose1,3
www.exa2ct.eu
SLIDE 2
Exa2ct EU project
SLIDE 3
GMRES, classical Gram-Schmidt
1: r0 ← b − Ax0 2: v0 ← r0/||r0||2 3: for i = 0, . . . , m − 1 do 4:
w ← Avi
5:
for j = 0, . . . , i do
6:
hj,i ← w, vj
7:
end for
8:
˜ vi+1 ← w − i
j=1 hj,ivj
9:
hi+1,i ← ||˜ vi+1||2
10:
vi+1 ← ˜ vi+1/hi+1,i
11:
{apply Givens rotations to h:,i }
12: end for 13: ym ←
argmin||Hm+1,mym − ||r0||2e1||2
14: x ← x0 + Vmym
Sparse Matrix-Vector product
◮ Only communication
with neighbors
◮ Good scaling but poor
bandwidth usage Dot-product
◮ Global communication ◮ Scales as log(P) and
suffers from jitter Scalar vector multiplication, vector-vector addition
◮ No communication
SLIDE 4 Performance of Kernel of dependent SpMV
◮ SIMD: SSE → AVX → mm512 on Xeon Phi. ◮ Similar with NEON on ARM.
Bandwidth is the bottleneck and will remain even with 3D stacked memory.
1 2 4 8 16 32 64 1/8 1/4 1/2 1 2 4 8 16 attainable GFLOP/sec Arithmetic Intensity FLOP/Byte peak memory BW peak floating-point with vectorization
Arithmetic intensity (Flop/byte)
- S. Williams, A. Waterman and D. Patterson (2008)
SLIDE 5 PATUS (Spatial Blocking)
1 2 3 4 5 6 7 8 9 10 10 20 30 40 # of threads Performance in GFlop/s Intel Ivy Bridge (Xeon E5-2660v2) 1 2 3 4 5 6 7 8 9 10 10 20 30 40 ·1,000 Memory Bandwidth in GBytes/s.
gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1
1 2 3 4 5 6 7 8 9 10 10 20 30 40 # of threads Performance in GFlop/s Intel Ivy Bridge (Xeon E5-2660v2) 1 2 3 4 5 6 7 8 9 10 10 20 30 40 ·1,000 Memory Bandwidth in GBytes/s.
gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1
Figure : a. 7 point-stencil const coef., b. 25-point stencil var coef.
◮ PATUS leads to better results and lower bandwidth use. ◮ Good results, but Bandwidth saturation around 10 cores.
Difficult to do better with spatial blocking.
- M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel
iterative stencil computations on modern microarchitectures. IPDPS’11.
- M. Christen, O. Schenk, and C. Yifeng. Patus for convenient high-performance stencils: evaluation in earthquake
simulations.
SLIDE 6 Arithmetic intensity of k dependent SpMVs
1 SpMV k SpMVs k SpMVs in place flops 2nnz 2k · nnz 2k · nnz words moved nnz + 2n knnz + 2kn nnz + 2n q 2 2 2k
- J. Demmel, CS 294-76 on Communication-Avoiding algorithms
- M. Hoemmen, Communication-avoiding Krylov subspace
- methods. (2010).
SLIDE 7 Increasing the arithmetic intensity
◮ Divide the domain in tiles which fit in the cache ◮ Ground surface is loaded in cache and reused k times ◮ Redundant work at the tile edges
s s + 1
us+1
i,j
us
i,j
us
i+1,j
us
i−1,j
us
i,j−1
us
i,j+1 B B 1 2 3 s
- P. Ghysels, P. Klosiewicz, and W. Vanroose. ”Improving the
arithmetic intensity of multigrid with the help of polynomial smoothers.” Num. Lin. Alg. Appl. 19 (2012): 253-267.
SLIDE 8 Use the help of PLUTO
Pluto: an automatic parallelizer and locality optimizer for multicores that performs a source to source code transformation.
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1 2 4 3
t
1 2 4 5 6 7 8 15
◮ Automatic loop
parallelization and vectorization.
◮ Locality optimizations
based on the polyhedral model.
◮ Important trade-off between parallelism and temporal locality.
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam,
- A. Rountev, and P. Sadayappan, Automatic Transformations for
Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.
SLIDE 9 PLUTO (Temporal blocking)
Increase the arithmetic intensity based on the underlying stencils.
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1 2 4 3
t
1 2 4 5 6 7 8 15
◮ Divide the domain in tiles
which fit in the cache.
◮ Reuse the data in the
tiles k times.
◮ Use compiler auto -
vectorization capabilities within each tiles.
◮ Increased parallelism and temporal data reuse. ◮ Inner-tiles suitable for external spatial blocking tools e.g
PATUS.
◮ Tiles representation suitable for most efficient scheduling
strategy e.g TBB.
SLIDE 10 Increasing the arithmetic intensity with a stencil compiler (Pluto)
16 32 64 128 256 0.25 0.5 1 2 4 8 16 32 64 GFlop/s arithmetic intensity (flop/byte) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline 16 32 64 128 256 0.25 0.5 1 2 4 8 16 32 64 GFlop/s arithmetic intensity (flop/byte) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline
Dual socket Intel Sandy Bridge, Intel Xeon Phi, 61 × 4 threads 16 × 2 threads
SLIDE 11
Stencil Compilers
◮ Pochoir: (MIT)
http://people.csail.mit.edu/yuantang/
◮ Patus: (USI) https://code.google.com/p/patus/ ◮ Pluto: http://pluto-compiler.sourceforge.net ◮ Ghost: (Erlangen) http://tiny.cc/ghost ◮ HHRT: (Tokyo) ◮ Modesto,
SLIDE 12
Krylov methods
◮ Classical Krylov
Kk(A, v) = span{v, Av, A2v, . . . , Ak−1v} the residual and error can then be written as r(k) = Pk(A)r(0) (1)
SLIDE 13
Krylov methods
◮ Classical Krylov
Kk(A, v) = span{v, Av, A2v, . . . , Ak−1v} the residual and error can then be written as r(k) = Pk(A)r(0) (1)
◮ Polynomial Preconditioning
Pm(A)x = q(A)Ax = q(A)b Kk(Pm(A), v) = span{v, Pm(A)v, Pm(A)2v, . . . , Pm(A)k−1v} where Pm(A) = (A − σ1)(A − σ2) · . . . · (A − σm) is a fixed low-order polynomial r(k) = Qk(Pm(A))r(0) (2)
SLIDE 14 Incomplete list of the literature on polynomial preconditioning
- Y. Saad, Iterative methods for sparse linear systems Chapter
12, SIAM (2003).
- D. O’Leary, Yet another polynomials preconditioner for the
Conjugate Gradient Algorithm, Lin. Alg. Appl. (1991) p377
- A. van der Ploeg, Preconditioning for sparse matrices with
applications, (1994)
- M. Van Gijzen, A polynomial preconditioner for the GMRES
algorithm J. Comp. Appl. Math 59 (1995): 91-107.
- A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG
methods for sparse matrices on massively parallel machines, 23 (1997), 381398 . . .
SLIDE 15
Convergence of CG with different short Polynomials
SLIDE 16
Time to solution is reduced
SLIDE 17 Recursive calculation of wk+1 = pk+1(A)v
1: σ1 = θ/δ 2: ρ0 = 1/σ1 3: w0 = 0, w1 = 1
θAv
4: ∆w1 = w1 − w0 5: for k=1,. . . do 6:
ρk = 1/(2σ1 − ρk−1)
7:
∆wk+1 = ρk 2
δA(v − wk) + ρk−1∆wk
wk+1 = wk + ∆wk+1
9: end for
SLIDE 18
Mangled with Pluto
SLIDE 19 Average cost for each matvec
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 2 4 6 8 10 12 14 16 18 20 averaged time for each matvec
total matvec dot-product axpy
2D 2024×2048 finite difference poisson problem on i7-2860QM
SLIDE 20
Patus and Pluto
SLIDE 21
Arithmetic Intensity Multigrid
MG = (I − I h
2h(. . .)I 2h h A)Sν1,
(3) where
◮ S is the smoother, for example ω-Jacobi where
S = I − ωD−1A,
◮ The interpolation I h 2h and restriction I 2h h .
These are all low arithmetic intensity.
SLIDE 22
Multigrid Cost Model
Increasing the number of smoothing steps per level
0.2 0.4 0.6 0.8 1 5 10 15 20 convergence rate smoothing steps 10 20 30 40 50 60 5 10 15 20 # V-cycles smoothing steps
MG iterations = log(10−8)/ log(ρ(V-cycle)) where ρ(V-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer
SLIDE 23
Work Unit Cost model
◮ Work Unit cost model ignores memory bandwidth
1 WU = smoother cost = O(n)
◮
(9ν + 19)(1 + 1 4 + . . . ) log(tol) log(ρ) WU ≤ (9ν + 19) 4 3 log(tol) log(ρ) WU
10 20 30 40 50 60 5 10 15 20 # V-cycles smoothing steps 500 1000 1500 2000 2500 5 10 15 20 work units smoothing steps
SLIDE 24 200 400 600 800 1000 1200 5 10 15 20 avg perf (Gflop/s) # smoothing steps peak BW, peak Flops no SIMD stream BW
Attainable GFlop/sec
0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 5 10 15 20 avg cost (s/Gflop) # smoothing steps peak BW, peak Flops no SIMD stream BW
Average cost in sec/GFlop ν1 × ω-Jac = flops (ν1 × ω-Jac) roof (q(ν1 × ω-Jac)) ≈ ν1flops (ω-Jac) roof (ν1q1(ω-Jac)) (4)
SLIDE 25
Roofline-based vs naive cost model
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 work units # smoothing steps roofline model naive model
SLIDE 26 Timings for 2D 81902 (Sandy Bridge)
5 10 15 20 25 2 4 6 8 10 12 14 16 18 20 time (s) smoothing steps naive pluto naive model roofline model
- P. Ghysels and W. Vanroose, Modelling the performance of
geometric multigrid on many-core computer architectures, Exascience.com techreport 09.2013.1 Sept, 2013, to appear in SISC
SLIDE 27
3d results 5113 (Sandy Bridge)
SLIDE 28
Current Krylov Solvers
Krylov Solver User Matvec ◮ User provides a matvec
routine and selects Krylov method at command line.
◮ No opportunities to increase
arithmetic intensity.
SLIDE 29
Current Krylov Solvers
Krylov Solver User Matvec ◮ User provides a matvec
routine and selects Krylov method at command line.
◮ No opportunities to increase
arithmetic intensity. Future Krylov Solvers
◮ User provides a
stencil routine.
◮ Stencil compiler
increases arithmetic intensity.
Stencil Compiler (Pochoir, Pluto, ...) Krylov Solver User Stencil
W .Vanroose, P. Ghysels, D. Roose and K. Meerbergen, Position paper at DOE Exascale Mathematics workshop 2013.
SLIDE 30 Conclusions
◮ Two communication bottlenecks:
◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.
SLIDE 31 Conclusions
◮ Two communication bottlenecks:
◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.
◮ Replace Av with a Pm(A)v with a fixed coefficients.
◮ Execute with a stencil compiler. ◮ Combine with Multigrid. ◮ Opportunities in filters for eigensolvers. ◮ Many open question for unstructured matrices.
SLIDE 32 Conclusions
◮ Two communication bottlenecks:
◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.
◮ Replace Av with a Pm(A)v with a fixed coefficients.
◮ Execute with a stencil compiler. ◮ Combine with Multigrid. ◮ Opportunities in filters for eigensolvers. ◮ Many open question for unstructured matrices.
◮ Can be combined with Pipelining of Krylov algorithms where
global reductions are hidden. Acknowledgements:
◮ EU’s Seventh Framework Programme (FP7/2007-2013) under
grant agreement n 610741. www.exa2ct.eu
◮ Intel and the Institute for the Promotion of Innovation
through Science and Technology in Flanders (IWT). www.exascience.com