Raising the Arithmetic Intensity of Krylov solvers 1 Applied - PowerPoint PPT Presentation

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH B. Reps 1 , P. Ghysels 2 , S. Donfack 4 , O. Schenk 4 , W. Vanroose 1 , 3 www.exa2ct.eu

Exa2ct EU project

GMRES, classical Gram-Schmidt 1: r 0 ← b − Ax 0 Sparse Matrix-Vector 2: v 0 ← r 0 / || r 0 || 2 product 3: for i = 0 , . . . , m − 1 do ◮ Only communication w ← Av i 4: with neighbors for j = 0 , . . . , i do 5: ◮ Good scaling but poor h j , i ← � w , v j � 6: bandwidth usage end for 7: Dot-product v i +1 ← w − � i ˜ j =1 h j , i v j 8: ◮ Global communication h i +1 , i ← || ˜ v i +1 || 2 9: v i +1 ← ˜ v i +1 / h i +1 , i ◮ Scales as log( P ) and 10: { apply Givens rotations to h : , i } 11: suffers from jitter 12: end for Scalar vector multiplication, 13: y m ← vector-vector addition argmin || H m +1 , m y m − || r 0 || 2 e 1 || 2 ◮ No communication 14: x ← x 0 + V m y m

Performance of Kernel of dependent SpMV ◮ SIMD: SSE → AVX → mm512 on Xeon Phi. ◮ Similar with NEON on ARM. Bandwidth is the bottleneck and will remain even with 3D stacked memory. 64 with vectorization 32 attainable GFLOP/sec peak floating-point 16 peak memory BW 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)

PATUS (Spatial Blocking) Intel Ivy Bridge (Xeon E5-2660v2) Intel Ivy Bridge (Xeon E5-2660v2) · 1 , 000 · 1 , 000 40 40 40 40 Memory Bandwidth in GBytes/s. Memory Bandwidth in GBytes/s. 30 30 Performance in GFlop/s Performance in GFlop/s 30 30 20 20 20 20 gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1 10 10 10 10 gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1 0 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 # of threads # of threads Figure : a. 7 point-stencil const coef., b. 25-point stencil var coef. ◮ PATUS leads to better results and lower bandwidth use. ◮ Good results, but Bandwidth saturation around 10 cores. Difficult to do better with spatial blocking. M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. IPDPS’11. M. Christen, O. Schenk, and C. Yifeng. Patus for convenient high-performance stencils: evaluation in earthquake simulations.

Arithmetic intensity of k dependent SpMVs 1 SpMV k SpMVs k SpMVs in place flops 2 n nz 2 k · n nz 2 k · n nz words moved n nz + 2 n kn nz + 2 kn n nz + 2 n q 2 2 2k J. Demmel, CS 294-76 on Communication-Avoiding algorithms M. Hoemmen, Communication-avoiding Krylov subspace methods. (2010).

Increasing the arithmetic intensity ◮ Divide the domain in tiles which fit in the cache ◮ Ground surface is loaded in cache and reused k times ◮ Redundant work at the tile edges u s +1 s + 1 i,j 3 u s i,j +1 2 s u s u s i,j i +1 ,j s 1 u s i − 1 ,j u s B 0 i,j − 1 0 0 B P. Ghysels, P. Klosiewicz, and W. Vanroose. ”Improving the arithmetic intensity of multigrid with the help of polynomial smoothers.” Num. Lin. Alg. Appl. 19 (2012): 253-267.

Use the help of PLUTO Pluto: an automatic parallelizer and locality optimizer for multicores that performs a source to source code transformation. ◮ Automatic loop 8 4 7 parallelization and 6 vectorization. 5 t 4 2 ◮ Locality optimizations 3 1 2 based on the polyhedral 1 model. 0 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 x ◮ Important trade-off between parallelism and temporal locality. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.

PLUTO (Temporal blocking) Increase the arithmetic intensity based on the underlying stencils. ◮ Divide the domain in tiles 8 which fit in the cache. 4 7 6 ◮ Reuse the data in the 5 t tiles k times. 4 2 3 ◮ Use compiler auto - 1 2 1 vectorization capabilities 0 2 within each tiles. 1 3 4 5 6 7 8 9 10 12 13 14 15 16 0 x 11 ◮ Increased parallelism and temporal data reuse. ◮ Inner-tiles suitable for external spatial blocking tools e.g PATUS. ◮ Tiles representation suitable for most efficient scheduling strategy e.g TBB.

Increasing the arithmetic intensity with a stencil compiler (Pluto) 256 256 naive naive naive, no-vec naive, no-vec pluto pluto pluto, no-vec pluto, no-vec 128 128 peak roofline peak roofline fitted roofline fitted roofline GFlop/s GFlop/s 64 64 32 32 16 16 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 2 4 8 16 32 64 arithmetic intensity (flop/byte) arithmetic intensity (flop/byte) Dual socket Intel Sandy Bridge, Intel Xeon Phi, 61 × 4 threads 16 × 2 threads

Stencil Compilers ◮ Pochoir: (MIT) http://people.csail.mit.edu/yuantang/ ◮ Patus: (USI) https://code.google.com/p/patus/ ◮ Pluto: http://pluto-compiler.sourceforge.net ◮ Ghost: (Erlangen) http://tiny.cc/ghost ◮ HHRT: (Tokyo) ◮ Modesto,

Krylov methods ◮ Classical Krylov K k ( A , v ) = span { v , Av , A 2 v , . . . , A k − 1 v } the residual and error can then be written as r ( k ) = P k ( A ) r (0) (1)

Krylov methods ◮ Classical Krylov K k ( A , v ) = span { v , Av , A 2 v , . . . , A k − 1 v } the residual and error can then be written as r ( k ) = P k ( A ) r (0) (1) ◮ Polynomial Preconditioning P m ( A ) x = q ( A ) Ax = q ( A ) b K k ( P m ( A ) , v ) = span { v , P m ( A ) v , P m ( A ) 2 v , . . . , P m ( A ) k − 1 v } where P m ( A ) = ( A − σ 1 )( A − σ 2 ) · . . . · ( A − σ m ) is a fixed low-order polynomial r ( k ) = Q k ( P m ( A )) r (0) (2)

Incomplete list of the literature on polynomial preconditioning Y. Saad, Iterative methods for sparse linear systems Chapter 12, SIAM (2003). D. O’Leary, Yet another polynomials preconditioner for the Conjugate Gradient Algorithm , Lin. Alg. Appl. (1991) p377 A. van der Ploeg, Preconditioning for sparse matrices with applications , (1994) M. Van Gijzen, A polynomial preconditioner for the GMRES algorithm J. Comp. Appl. Math 59 (1995): 91-107. A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG methods for sparse matrices on massively parallel machines , 23 (1997), 381398 . . .

Convergence of CG with different short Polynomials

Time to solution is reduced

Recursive calculation of w k +1 = p k +1 ( A ) v 1: σ 1 = θ/δ 2: ρ 0 = 1 /σ 1 3: w 0 = 0, w 1 = 1 θ Av 4: ∆ w 1 = w 1 − w 0 5: for k=1,. . . do ρ k = 1 / (2 σ 1 − ρ k − 1 ) 6: � 2 � ∆ w k +1 = ρ k δ A ( v − w k ) + ρ k − 1 ∆ w k 7: w k +1 = w k + ∆ w k +1 8: 9: end for

Mangled with Pluto

Average cost for each matvec 0.05 total 0.045 matvec dot-product averaged time for each matvec 0.04 axpy 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 2 4 6 8 10 12 14 16 18 20 order of polynomial 2D 2024 × 2048 finite difference poisson problem on i7-2860QM

Patus and Pluto

Arithmetic Intensity Multigrid 2 h ( . . . ) I 2 h h A ) S ν 1 , MG = ( I − I h (3) where ◮ S is the smoother, for example ω -Jacobi where S = I − ω D − 1 A , 2 h and restriction I 2 h ◮ The interpolation I h h . These are all low arithmetic intensity.

Multigrid Cost Model Increasing the number of smoothing steps per level 1 60 50 0.8 convergence rate # V-cycles 40 0.6 30 0.4 20 0.2 10 0 0 0 5 10 15 20 0 5 10 15 20 smoothing steps smoothing steps MG iterations = log(10 − 8 ) / log( ρ (V-cycle)) where ρ (V-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer

Work Unit Cost model ◮ Work Unit cost model ignores memory bandwidth 1 WU = smoother cost = O ( n ) ◮ (9 ν + 19)(1 + 1 4 + . . . ) log( tol ) log( ρ ) WU ≤ (9 ν + 19) 4 log( tol ) log( ρ ) WU 3 60 2500 50 2000 40 # V-cycles work units 1500 30 1000 20 500 10 0 0 0 5 10 15 20 0 5 10 15 20 smoothing steps smoothing steps

1200 0.016 peak BW, peak Flops 0.014 no SIMD 1000 stream BW 0.012 avg perf (Gflop/s) avg cost (s/Gflop) 800 0.01 peak BW, peak Flops 600 no SIMD 0.008 stream BW 0.006 400 0.004 200 0.002 0 0 0 5 10 15 20 0 5 10 15 20 # smoothing steps # smoothing steps Attainable GFlop/sec Average cost in sec/GFlop flops ( ν 1 × ω -Jac) ν 1 flops ( ω -Jac) � ν 1 × ω -Jac � = roof ( q ( ν 1 × ω -Jac)) ≈ (4) roof ( ν 1 q 1 ( ω -Jac))

Roofline-based vs naive cost model 1.2 roofline model naive model 1 0.8 work units 0.6 0.4 0.2 0 0 5 10 15 20 # smoothing steps

Timings for 2D 8190 2 (Sandy Bridge) 25 naive pluto 20 naive model roofline model 15 time (s) 10 5 0 0 2 4 6 8 10 12 14 16 18 20 smoothing steps P. Ghysels and W. Vanroose, Modelling the performance of geometric multigrid on many-core computer architectures , Exascience.com techreport 09.2013.1 Sept, 2013, to appear in SISC

3d results 511 3 (Sandy Bridge)

Current Krylov Solvers ◮ User provides a matvec routine and selects Krylov method at command line. Krylov User Matvec ◮ No opportunities to increase Solver arithmetic intensity.

Raising the Arithmetic Intensity of Krylov solvers 1 Applied - PowerPoint PPT Presentation

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH B. Reps 1 , P. Ghysels 2 , S. Donfack 4 , O.

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Multilevel Krylov Methods Deflation Deflation, DD, MG Reinhard Nabben Multilevel Krylov

Whats so great about Krylov subspaces? David S. Watkins Department of Mathematics Washington

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Introducing Krylov eBay AI Platform - Machine Learning Made Easy Henry Saputra Technical Lead

On rational Krylov sequences Karl Meerbergen K.U. Leuven Rolling waves December 1516,

New tools for intensity interferometry New tools for intensity interferometry First experiment in

Medial left-node raising in Japanese Shichi Yatabe University of Tokyo Right-node raising

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

iOptimizing Krylov Subspace Solvers on iiGraphics Processing Units Hartwig Anzt, W. Sawyer, S.

A quick introduction to Krylov solvers for the solution of linear systems L. Giraud (Inria)

International Adaptation and Use of the Supports Intensity Scales James R. Thompson, PhD.

EBC SOUND UNIT DECIBEL LEVELS dB - units of relative intensity W/m 2 - units of intensity other

Titanium: A High-Performance Java Dialect Jason Ryder Matt Beaumont-Gay Aravind Bappanadu

Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages Marek Ko sta

Principle of Parallel Algorithm Design Alexandre David B2-206 Today Preliminaries (3.1).

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

In Intro roduc ductio ion t n to P Paralle arallel P l Pro rogram gramming ing for

High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale

Parallel Recursive Programs Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur

Computer Graphics (CS 543) Lecture 5 (part 2): Projection (Part 2): Derivation Prof Emmanuel Agu