Raising the Arithmetic Intensity of Krylov solvers 1 Applied - - PowerPoint PPT Presentation

raising the arithmetic intensity of krylov solvers
SMART_READER_LITE
LIVE PREVIEW

Raising the Arithmetic Intensity of Krylov solvers 1 Applied - - PowerPoint PPT Presentation

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH B. Reps 1 , P. Ghysels 2 , S. Donfack 4 , O.


slide-1
SLIDE 1

Raising the Arithmetic Intensity of Krylov solvers

1Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3Intel ExaScience Lab, Belgium 4USI, Lugano, CH

  • B. Reps1, P. Ghysels2, S. Donfack4, O. Schenk4,
  • W. Vanroose1,3

www.exa2ct.eu

slide-2
SLIDE 2

Exa2ct EU project

slide-3
SLIDE 3

GMRES, classical Gram-Schmidt

1: r0 ← b − Ax0 2: v0 ← r0/||r0||2 3: for i = 0, . . . , m − 1 do 4:

w ← Avi

5:

for j = 0, . . . , i do

6:

hj,i ← w, vj

7:

end for

8:

˜ vi+1 ← w − i

j=1 hj,ivj

9:

hi+1,i ← ||˜ vi+1||2

10:

vi+1 ← ˜ vi+1/hi+1,i

11:

{apply Givens rotations to h:,i }

12: end for 13: ym ←

argmin||Hm+1,mym − ||r0||2e1||2

14: x ← x0 + Vmym

Sparse Matrix-Vector product

◮ Only communication

with neighbors

◮ Good scaling but poor

bandwidth usage Dot-product

◮ Global communication ◮ Scales as log(P) and

suffers from jitter Scalar vector multiplication, vector-vector addition

◮ No communication

slide-4
SLIDE 4

Performance of Kernel of dependent SpMV

◮ SIMD: SSE → AVX → mm512 on Xeon Phi. ◮ Similar with NEON on ARM.

Bandwidth is the bottleneck and will remain even with 3D stacked memory.

1 2 4 8 16 32 64 1/8 1/4 1/2 1 2 4 8 16 attainable GFLOP/sec Arithmetic Intensity FLOP/Byte peak memory BW peak floating-point with vectorization

Arithmetic intensity (Flop/byte)

  • S. Williams, A. Waterman and D. Patterson (2008)
slide-5
SLIDE 5

PATUS (Spatial Blocking)

1 2 3 4 5 6 7 8 9 10 10 20 30 40 # of threads Performance in GFlop/s Intel Ivy Bridge (Xeon E5-2660v2) 1 2 3 4 5 6 7 8 9 10 10 20 30 40 ·1,000 Memory Bandwidth in GBytes/s.

gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1

1 2 3 4 5 6 7 8 9 10 10 20 30 40 # of threads Performance in GFlop/s Intel Ivy Bridge (Xeon E5-2660v2) 1 2 3 4 5 6 7 8 9 10 10 20 30 40 ·1,000 Memory Bandwidth in GBytes/s.

gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1

Figure : a. 7 point-stencil const coef., b. 25-point stencil var coef.

◮ PATUS leads to better results and lower bandwidth use. ◮ Good results, but Bandwidth saturation around 10 cores.

Difficult to do better with spatial blocking.

  • M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel

iterative stencil computations on modern microarchitectures. IPDPS’11.

  • M. Christen, O. Schenk, and C. Yifeng. Patus for convenient high-performance stencils: evaluation in earthquake

simulations.

slide-6
SLIDE 6

Arithmetic intensity of k dependent SpMVs

1 SpMV k SpMVs k SpMVs in place flops 2nnz 2k · nnz 2k · nnz words moved nnz + 2n knnz + 2kn nnz + 2n q 2 2 2k

  • J. Demmel, CS 294-76 on Communication-Avoiding algorithms
  • M. Hoemmen, Communication-avoiding Krylov subspace
  • methods. (2010).
slide-7
SLIDE 7

Increasing the arithmetic intensity

◮ Divide the domain in tiles which fit in the cache ◮ Ground surface is loaded in cache and reused k times ◮ Redundant work at the tile edges

s s + 1

us+1

i,j

us

i,j

us

i+1,j

us

i−1,j

us

i,j−1

us

i,j+1 B B 1 2 3 s

  • P. Ghysels, P. Klosiewicz, and W. Vanroose. ”Improving the

arithmetic intensity of multigrid with the help of polynomial smoothers.” Num. Lin. Alg. Appl. 19 (2012): 253-267.

slide-8
SLIDE 8

Use the help of PLUTO

Pluto: an automatic parallelizer and locality optimizer for multicores that performs a source to source code transformation.

x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1 2 4 3

t

1 2 4 5 6 7 8 15

◮ Automatic loop

parallelization and vectorization.

◮ Locality optimizations

based on the polyhedral model.

◮ Important trade-off between parallelism and temporal locality.

  • U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam,
  • A. Rountev, and P. Sadayappan, Automatic Transformations for

Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.

slide-9
SLIDE 9

PLUTO (Temporal blocking)

Increase the arithmetic intensity based on the underlying stencils.

x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1 2 4 3

t

1 2 4 5 6 7 8 15

◮ Divide the domain in tiles

which fit in the cache.

◮ Reuse the data in the

tiles k times.

◮ Use compiler auto -

vectorization capabilities within each tiles.

◮ Increased parallelism and temporal data reuse. ◮ Inner-tiles suitable for external spatial blocking tools e.g

PATUS.

◮ Tiles representation suitable for most efficient scheduling

strategy e.g TBB.

slide-10
SLIDE 10

Increasing the arithmetic intensity with a stencil compiler (Pluto)

16 32 64 128 256 0.25 0.5 1 2 4 8 16 32 64 GFlop/s arithmetic intensity (flop/byte) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline 16 32 64 128 256 0.25 0.5 1 2 4 8 16 32 64 GFlop/s arithmetic intensity (flop/byte) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline

Dual socket Intel Sandy Bridge, Intel Xeon Phi, 61 × 4 threads 16 × 2 threads

slide-11
SLIDE 11

Stencil Compilers

◮ Pochoir: (MIT)

http://people.csail.mit.edu/yuantang/

◮ Patus: (USI) https://code.google.com/p/patus/ ◮ Pluto: http://pluto-compiler.sourceforge.net ◮ Ghost: (Erlangen) http://tiny.cc/ghost ◮ HHRT: (Tokyo) ◮ Modesto,

slide-12
SLIDE 12

Krylov methods

◮ Classical Krylov

Kk(A, v) = span{v, Av, A2v, . . . , Ak−1v} the residual and error can then be written as r(k) = Pk(A)r(0) (1)

slide-13
SLIDE 13

Krylov methods

◮ Classical Krylov

Kk(A, v) = span{v, Av, A2v, . . . , Ak−1v} the residual and error can then be written as r(k) = Pk(A)r(0) (1)

◮ Polynomial Preconditioning

Pm(A)x = q(A)Ax = q(A)b Kk(Pm(A), v) = span{v, Pm(A)v, Pm(A)2v, . . . , Pm(A)k−1v} where Pm(A) = (A − σ1)(A − σ2) · . . . · (A − σm) is a fixed low-order polynomial r(k) = Qk(Pm(A))r(0) (2)

slide-14
SLIDE 14

Incomplete list of the literature on polynomial preconditioning

  • Y. Saad, Iterative methods for sparse linear systems Chapter

12, SIAM (2003).

  • D. O’Leary, Yet another polynomials preconditioner for the

Conjugate Gradient Algorithm, Lin. Alg. Appl. (1991) p377

  • A. van der Ploeg, Preconditioning for sparse matrices with

applications, (1994)

  • M. Van Gijzen, A polynomial preconditioner for the GMRES

algorithm J. Comp. Appl. Math 59 (1995): 91-107.

  • A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG

methods for sparse matrices on massively parallel machines, 23 (1997), 381398 . . .

slide-15
SLIDE 15

Convergence of CG with different short Polynomials

slide-16
SLIDE 16

Time to solution is reduced

slide-17
SLIDE 17

Recursive calculation of wk+1 = pk+1(A)v

1: σ1 = θ/δ 2: ρ0 = 1/σ1 3: w0 = 0, w1 = 1

θAv

4: ∆w1 = w1 − w0 5: for k=1,. . . do 6:

ρk = 1/(2σ1 − ρk−1)

7:

∆wk+1 = ρk 2

δA(v − wk) + ρk−1∆wk

  • 8:

wk+1 = wk + ∆wk+1

9: end for

slide-18
SLIDE 18

Mangled with Pluto

slide-19
SLIDE 19

Average cost for each matvec

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 2 4 6 8 10 12 14 16 18 20 averaged time for each matvec

  • rder of polynomial

total matvec dot-product axpy

2D 2024×2048 finite difference poisson problem on i7-2860QM

slide-20
SLIDE 20

Patus and Pluto

slide-21
SLIDE 21

Arithmetic Intensity Multigrid

MG = (I − I h

2h(. . .)I 2h h A)Sν1,

(3) where

◮ S is the smoother, for example ω-Jacobi where

S = I − ωD−1A,

◮ The interpolation I h 2h and restriction I 2h h .

These are all low arithmetic intensity.

slide-22
SLIDE 22

Multigrid Cost Model

Increasing the number of smoothing steps per level

0.2 0.4 0.6 0.8 1 5 10 15 20 convergence rate smoothing steps 10 20 30 40 50 60 5 10 15 20 # V-cycles smoothing steps

MG iterations = log(10−8)/ log(ρ(V-cycle)) where ρ(V-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer

slide-23
SLIDE 23

Work Unit Cost model

◮ Work Unit cost model ignores memory bandwidth

1 WU = smoother cost = O(n)

(9ν + 19)(1 + 1 4 + . . . ) log(tol) log(ρ) WU ≤ (9ν + 19) 4 3 log(tol) log(ρ) WU

10 20 30 40 50 60 5 10 15 20 # V-cycles smoothing steps 500 1000 1500 2000 2500 5 10 15 20 work units smoothing steps

slide-24
SLIDE 24

200 400 600 800 1000 1200 5 10 15 20 avg perf (Gflop/s) # smoothing steps peak BW, peak Flops no SIMD stream BW

Attainable GFlop/sec

0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 5 10 15 20 avg cost (s/Gflop) # smoothing steps peak BW, peak Flops no SIMD stream BW

Average cost in sec/GFlop ν1 × ω-Jac = flops (ν1 × ω-Jac) roof (q(ν1 × ω-Jac)) ≈ ν1flops (ω-Jac) roof (ν1q1(ω-Jac)) (4)

slide-25
SLIDE 25

Roofline-based vs naive cost model

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 work units # smoothing steps roofline model naive model

slide-26
SLIDE 26

Timings for 2D 81902 (Sandy Bridge)

5 10 15 20 25 2 4 6 8 10 12 14 16 18 20 time (s) smoothing steps naive pluto naive model roofline model

  • P. Ghysels and W. Vanroose, Modelling the performance of

geometric multigrid on many-core computer architectures, Exascience.com techreport 09.2013.1 Sept, 2013, to appear in SISC

slide-27
SLIDE 27

3d results 5113 (Sandy Bridge)

slide-28
SLIDE 28

Current Krylov Solvers

Krylov Solver User Matvec ◮ User provides a matvec

routine and selects Krylov method at command line.

◮ No opportunities to increase

arithmetic intensity.

slide-29
SLIDE 29

Current Krylov Solvers

Krylov Solver User Matvec ◮ User provides a matvec

routine and selects Krylov method at command line.

◮ No opportunities to increase

arithmetic intensity. Future Krylov Solvers

◮ User provides a

stencil routine.

◮ Stencil compiler

increases arithmetic intensity.

Stencil Compiler (Pochoir, Pluto, ...) Krylov Solver User Stencil

W .Vanroose, P. Ghysels, D. Roose and K. Meerbergen, Position paper at DOE Exascale Mathematics workshop 2013.

slide-30
SLIDE 30

Conclusions

◮ Two communication bottlenecks:

◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.

slide-31
SLIDE 31

Conclusions

◮ Two communication bottlenecks:

◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.

◮ Replace Av with a Pm(A)v with a fixed coefficients.

◮ Execute with a stencil compiler. ◮ Combine with Multigrid. ◮ Opportunities in filters for eigensolvers. ◮ Many open question for unstructured matrices.

slide-32
SLIDE 32

Conclusions

◮ Two communication bottlenecks:

◮ limited BW in Av. No benefit from SIMD. ◮ Latencies in (u, v). Poor strong scaling.

◮ Replace Av with a Pm(A)v with a fixed coefficients.

◮ Execute with a stencil compiler. ◮ Combine with Multigrid. ◮ Opportunities in filters for eigensolvers. ◮ Many open question for unstructured matrices.

◮ Can be combined with Pipelining of Krylov algorithms where

global reductions are hidden. Acknowledgements:

◮ EU’s Seventh Framework Programme (FP7/2007-2013) under

grant agreement n 610741. www.exa2ct.eu

◮ Intel and the Institute for the Promotion of Innovation

through Science and Technology in Flanders (IWT). www.exascience.com