Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR - PowerPoint PPT Presentation

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen Thomas 1 Kasia ´ Swirydowicz 1 Shreyas Ananthan 1 Michael Sprague 1 Julien Langou 2 Daniel Bielich 2 Ruipeng Li 3 Rob Falgout 3 Ulrike M. Yang 3 Mark Hoemmen 4 Ichitaro Yamazaki 4 Eric Boman 4 James Elliot 4 1 High Performance Algorithms and Complex Fluids Group National Renewable Energy Laboratory, Golden CO 2 Department of Mathematics University of Colorado, Denver, CO 3 Center for Applied Scientific Computing Lawrence Livermore National Laboratories, Livermore, CA. 4 Sandia National Laboratory, Livermore, CA and Albuquerque, NM National Center for Atmospheric Research September 25, 2019 exascaleproject.org

Scalable Solvers Grand challenge ECP simulations • High-fidelity wind turbine simulations, to capture wake vortex formation • Nalu-wind Navier-Stokes incompressible flow solver. O ( 100 ) Billion grid cells • GMRES with symmetric Gauss-Seidel (SGS), and algebraic multigrid (AMG) preconditioners • Integration rate of O ( 10 ) sec time per time step. O ( 1 ) hour per revolution Recent advances in scalable solvers and preconditioners • Multi-MPI Multi-GPU implementation of low-sync one-reduce ICWY MGS-GMRES (Hypre) • Multi-MPI Multi-GPU implementation of low-sync one-reduce s -step CA-GMRES (Trilinos) • 2 × faster Nalu-Wind matrix assembly with optimal hypre CSR memory allocations • low-rank approximate AINV hypre smoothers (NVIDIA V100 GPU cuda) • 25 × faster than cuSparse lower triangular solve on GPU 2

Gram-Schmidt and GMRES Solve Ax = b , iterative Krylov methods in DOE libraries ( hypre , Trilinos, PETSc) • Start with x 0 (initial guess), r 0 = b − Ax 0 , v 1 = r 0 / � r 0 � 2 • Search for update to x 0 in Krylov space : K m = span { v 1 , Av 1 , A 2 v 1 ,... A m − 1 v 1 } , • Krylov vectors form columns of V : , i , with v i := V : , i , • Arnoldi–GMRES solver based on the QR factorization of � � � � r 0 , AV m = V m + 1 � r 0 � e 1 , H m + 1 , m Theorem. One synch per column is sufficient for (MGS, CGS-2) Gram-Schmidt QR and Arnoldi- QR • Inverse compact ICWY MGS-GMRES based on lower-triangular solve, O ( ε ) κ ( A ) orthogonality • Recursive classical Gram-Schmidt (CGS-2), O ( ε ) orthogonality Corollary. The backward stability and loss of orthogonality are equivalent to the original algorithms 3

One-reduce Gram-Schmidt Algorithms Björck (1967), Lemma 5.1 and Corollary, Ruhe (1983) Modified Gram-Schmidt projector P M a j = ( I − Q j − 1 T M Q T j − 1 ) a j inverse compact WY form T M = ( I + L j − 1 ) − 1 Classical Gram-Schmidt with reorthogonalization P IC a j = ( I − Q j − 1 T IC Q T j − 1 ) a j j − 1 Q j − 1 ) − 1 Q T where we are approximating P = I − Q j − 1 ( Q T j − 1 � � 0 0 T IC = ( Q T j − 1 Q j − 1 ) − 1 = I − L j − 1 = I − − w T 0 4

Inverse Compact WY Modified Gram-Schmidt Q T Q = I + L + L T , compute L T one column per iteration Algorithm 1 Triangular Solve Modified Gram-Schmidt left-looking (columns) 1: for j = 1 , 2 ,... n do q j = a j 2: if j > 1 then 3: T 1 : j − 1 , j − 1 = Q T : , 1 : j − 1 q j − 1 ⊲ Synchronization 4: L 1 : j − 1 , 1 : j − 1 = tril ( T 1 : j − 1 , 1 : j − 1 , − 1 ) 5: R 1 : j − 1 , j = ( I + L 1 : j − 1 , 1 : j − 1 ) − 1 Q T ⊲ Lower triangular solve : , 1 : j − 1 a j 6: q j = q j − Q : , 1 : j − 1 R 1 : j − 1 , j 7: end if 8: R jj = � q j � 2 9: q j = q j / R jj 10: 11: end for 5

Algorithm 2 One reduction MGS-GMRES with Lagged Normalization 1: r 0 = b − Ax 0 , v 1 = r 0 . 2: v 2 = Av 1 3: ( V 2 , R , L 2 ) = mgs ( V 2 , R , L 1 ) 4: for i = 1 , 2 ,... do v i + 2 = Av i + 1 ⊲ Matrix-vector product 5: [ L T : , i + 1 , r i + 2 ] = V T i + 1 [ v i + 1 v i + 2 ] ⊲ Global synchronization 6: r i + 1 , i + 1 = � v i + 1 � 2 7: v i + 1 = v i + 1 / r i + 1 , i + 1 ⊲ Lagged normalization 8: r 1 : i + 1 , i + 2 = r 1 : i + 1 , i + 2 / r i + 1 , i + 1 ⊲ Scale for Arnoldi 9: v i + 2 = v i + 2 / r i + 1 , i + 1 ⊲ Scale for Arnoldi 10: r i + 1 , i + 2 = r i + 1 , i + 2 / r i + 1 , i + 1 11: L T : , i + 1 = L T : , i + 1 / r i + 1 , i + 1 12: r 1 : i + 1 , i + 2 = ( I + L i + 1 ) − 1 r 1 : i + 1 , i + 2 ⊲ Lower triangular solve 13: v i + 2 = v i + 2 − V i + 1 r 1 : i + 1 , i + 2 14: H i = R i + 1 15: Apply Givens rotations to H i 16: 17: end for 18: y m = argmin � ( H m y m −� r 0 � 2 e 1 ) � 2 19: x = x 0 + V m y m 6

Normwise Relative Backward Error Stopping criterion for MGS-GMRES • When does MGS-GMRES reach minimum error level ? � r k � 2 = � b − Ax k � 2 < tol � b � 2 � b � 2 flattens when � S � = 1 and the columns of V k become linearly dependent • Paige, Rozložnik and Strakoš (2006), normwise relative backwards error � r k � 2 < O ( ε ) � b � 2 + � A � 2 � x � 2 achieved when � S � 2 = 1. • Paige notes: For a sufficiently nonsingular matrix σ min ( A ) ≫ n 2 ε � A � F can employ MGS-GMRES to solve Ax = b with NBRE stopping criterion. 7

Normwise Relative Backward Error Figure: Greenbaum, Rozložnik and Strakoš (1997). impcol_e matrix 8

Low-Synch GMRES Multi-MPI Multi-GPU ´ Swirydowicz et al (2019), Low-Synch Gram-Schmidt and GMRES (NLAA) (a) Hypre Level-1 BLAS MGS-GMRES (b) Low-Synch ICWY MGS-GMRES Figure: Performance on Eagle Skylake + V100 GPU. NREL ABL Mesh n = 9M • Extended strong scaling roll-off by 4x for ExaWind ABL grid on Skylake + V100. 9

Low-Synch GMRES Multi-MPI Multi-GPU (a) Hypre Gram-Schmidt Times (b) Low-Synch Gram-Schmidt Time Figure: Gram-Schmidt kernels time on Eagle Skylake + V100 GPU. NREL ABL Mesh n = 9M. Gram-Schmidt times are flat for low-synch algorithm(s) 10

McAlister Blade Figure: McAlister blade. Computational mesh. 11

McAlister Blade Figure: McAlister blade. Tip vortex 12

GMRES-AMG reduced iterations with CGS-2 projection Figure: Projected x 0 . CGS-2. Nalu-Wind pressure solve. McAlister blade. 13

Iterative Classical Gram-Schmidt CGS-2 Algorithm 3 Iterated Classical Gram-Schmidt (CGS-2) Algorithm Input: Matrices Q j − 1 and R j − 1 , A j − 1 = Q j − 1 R j − 1 ; column vector q j = a j Output: Q j and R j , such that A j = Q j R j 1: for k = 1 , 2 do s ( k − 1 ) = Q T j − 1 a ( k − 1 ) ⊲ Global synch 2: r ( k ) 1 : j − 1 , j = r ( k ) 1 : j − 1 , j + s ( k − 1 ) 3: q j = q j − Q j − 1 s ( k − 1 ) 4: 5: end for 6: q j = a ( k ) / r jj = � a ( k ) � 2 ⊲ Global synch 7: r j = [ r 1 : j − 1 , j , r jj ] 14

Recursive Classical Gram-Schmidt CGS-2 Orthogonal projector takes the form, A = QR , P IC I − [ Q j − 2 , q j − 1 − Q j − 2 Q T j − 2 q j − 1 ] T j − 1 [ Q j − 2 , q j − 1 ] T = j − 2 Q j − 2 ) − 1 = T j − 1 = I − L j − 1 − L T Correction matrix ( Q T j − 1 takes the form � � � I 0 � T j − 2 T : , 1 : j − 2 T j − 1 = = − w T α − 1 T 1 : j − 2 , : t j − 1 , j − 1 1 w T = L j − 2 , : . Thus we have, r 1 : j − 2 , j = z , α = r − 1 j − 1 , j − 1 and where � r j − 1 , j − 1 , � � � j − 1 q j − 1 − w T w ) , j − 1 a j − w T z ) � Q T Q T ( q T ( q T � � j − 2 q j − 1 , j − 2 a j � w , z = , r j − 1 , j = Faster solver and more accurate eigenvalue computations based on Lanczos/Arnoldi algorithms PETSc-SLEPc. Corrects the Hernandez, Roman, Tomas (2005), (2007) DCGS-2 algorithm Rolling Stones Theorem: You can’t always get what you want, But if you try sometime you find, you get what you need 15

Algorithm 4 Classical Gram-Schmidt (CGS-2) Algorithm with Normalization Lag and Reorthogonalization Lag Input: Matrices Q j − 1 and R j − 1 , A j − 1 = Q j − 1 R j − 1 ; column vector q j = a j Output: Q j and R j , such that A j = Q j R j 1: if j = 1 return 2: [ r j − 1 , j − 1 , r j − 1 , j ] = q T j − 1 [ q j − 1 , q j ] ⊲ Global synch 3: if j > 2 then [ w , z ] = Q T j − 2 [ q j − 1 , q j ] , ⊲ same global synch 4: [ r j − 1 , j − 1 , r j − 1 , j ] = [ r j − 1 , j − 1 − w T w , r j − 1 , j − w T z ] ⊲ Pythagorean 5: r 1 : j − 2 , j − 1 = r 1 : j − 2 , j − 1 + w ⊲ Lagged R update 6: 7: end if 8: r j − 1 , j − 1 = { r j − 1 , j − 1 } 1 / 2 ⊲ Lagged norm 9: if j > 2 q j − 1 = q j − 1 − Q j − 2 w ⊲ Lagged Reorthogonalization 10: q j − 1 = q j − 1 / r j − 1 , j − 1 ⊲ Lagged Normalization 11: r j − 1 , j = r j − 1 , j / r j − 1 , j − 1 12: r 1 : j − 2 , j = z ⊲ Apply recursive projection 13: q j = q j − Q j − 1 r j Pythagorean form (can) mitigate cancellation. a 2 − b 2 = ( a − b )( a + b ) Smoktunowicz, Barlow, Langou (2008). Ghysels et al (2013) 16

Backward (representation) error Figure: Representation error for classical Gram-Schmidt 17

Orthogonality Figure: Orthogonality for classical Gram-Schmidt 18

Trilinos s -step GMRES Figure: McAlister Solve Time. 19

Trilinos s -step GMRES Figure: Convergence for s -step GMRES. Laplacian matrix 20

Trilinos s -step GMRES Figure: Convergence for s -step GMRES. Steam-1 matrix 21

Trilinos s -step GMRES Figure: Time per iteration s -step GMRES. 22

Trilinos s -step GMRES Figure: Time per iteration s -step GMRES. 23

Trilinos s -step GMRES Figure: Strong scaling s -step GMRES versus one-sync GMRES. 24

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR - PowerPoint PPT Presentation

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen Thomas 1 Kasia Swirydowicz 1 Shreyas Ananthan 1 Michael Sprague 1 Julien Langou 2 Daniel Bielich 2 Ruipeng Li 3 Rob Falgout 3 Ulrike M. Yang 3 Mark

Sorting with Pop Stacks Stacks Pop stacks 1-Pop-stack sortability 2-Pop-stack sortability

Stacks O tline Outline The Stack ADT Applications of Stacks Applications of Stacks

Stacks Stacks The unorganized persons data structure stacks 1 Stack characteristics Stack

Lists, Stacks and Queues Stacks and Queues Stacks n A restricted list where insertions and

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

csci 210: Data Structures Stacks and Queues Summary Topics stacks and queues as

6.2. An abstract datatype for stacks Stacks of objects of type : stack Operations value

Stacks, Queues, and Priority Queues Inf 2B: Heaps and Priority Queues Stacks, queues, and priority

CS 171: Introduction to Computer Science II Stacks Li Xiong Today Stacks operations and

Outline and Reading The Stack ADT (2.1.1) Stacks Applications of Stacks (2.1.1) Array-based

Stacks Outline and Reading The Stack ADT (4.2.1) Applications of Stacks (4.2.3) Array-based

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

CS200: Stacks n Prichard Ch. 7 CS200 - Stacks 1 Linear, time-ordered structures n Data

CS200: Stacks n Prichard Ch. 7 CS200 - Stacks 1 Linear, time-ordered structures n Data

Full Year Results for the year ended 31 December 2016 Disclaimer This presentation may contain

PRINCETON CONFERENCE Residential Appraisal Best Practices FROM THE REVIEW APPRAISERS DESK:

Introducing Ulixes EEIG Ulixes Training and Research European Economic Interest Grouping

How much data is enough? Predicting accuracy on large datasets from smaller pilot data Mark

O. Piersanti Athens 0 7 / 0 2 / 2 0 1 4 ESA UNCLASSI FI ED For Official Use Cosm ic Vision

rasooli.ms{AT}gmail.com

Residential Evaluation Report (RER) ensuring compliance with the Interagency Guidelines (IAG) and

HOW RESERVE SOLUTIONS ENSURE GRID STABILITY WITH 100 % RENEWABLE ENERGY! RESERVE TACKLES POWER