recent advances in sparse linear solver stacks for
play

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR - PowerPoint PPT Presentation

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen Thomas 1 Kasia Swirydowicz 1 Shreyas Ananthan 1 Michael Sprague 1 Julien Langou 2 Daniel Bielich 2 Ruipeng Li 3 Rob Falgout 3 Ulrike M. Yang 3 Mark


  1. Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen Thomas 1 Kasia ´ Swirydowicz 1 Shreyas Ananthan 1 Michael Sprague 1 Julien Langou 2 Daniel Bielich 2 Ruipeng Li 3 Rob Falgout 3 Ulrike M. Yang 3 Mark Hoemmen 4 Ichitaro Yamazaki 4 Eric Boman 4 James Elliot 4 1 High Performance Algorithms and Complex Fluids Group National Renewable Energy Laboratory, Golden CO 2 Department of Mathematics University of Colorado, Denver, CO 3 Center for Applied Scientific Computing Lawrence Livermore National Laboratories, Livermore, CA. 4 Sandia National Laboratory, Livermore, CA and Albuquerque, NM National Center for Atmospheric Research September 25, 2019 exascaleproject.org

  2. Scalable Solvers Grand challenge ECP simulations • High-fidelity wind turbine simulations, to capture wake vortex formation • Nalu-wind Navier-Stokes incompressible flow solver. O ( 100 ) Billion grid cells • GMRES with symmetric Gauss-Seidel (SGS), and algebraic multigrid (AMG) preconditioners • Integration rate of O ( 10 ) sec time per time step. O ( 1 ) hour per revolution Recent advances in scalable solvers and preconditioners • Multi-MPI Multi-GPU implementation of low-sync one-reduce ICWY MGS-GMRES (Hypre) • Multi-MPI Multi-GPU implementation of low-sync one-reduce s -step CA-GMRES (Trilinos) • 2 × faster Nalu-Wind matrix assembly with optimal hypre CSR memory allocations • low-rank approximate AINV hypre smoothers (NVIDIA V100 GPU cuda) • 25 × faster than cuSparse lower triangular solve on GPU 2

  3. Gram-Schmidt and GMRES Solve Ax = b , iterative Krylov methods in DOE libraries ( hypre , Trilinos, PETSc) • Start with x 0 (initial guess), r 0 = b − Ax 0 , v 1 = r 0 / � r 0 � 2 • Search for update to x 0 in Krylov space : K m = span { v 1 , Av 1 , A 2 v 1 ,... A m − 1 v 1 } , • Krylov vectors form columns of V : , i , with v i := V : , i , • Arnoldi–GMRES solver based on the QR factorization of � � � � r 0 , AV m = V m + 1 � r 0 � e 1 , H m + 1 , m Theorem. One synch per column is sufficient for (MGS, CGS-2) Gram-Schmidt QR and Arnoldi- QR • Inverse compact ICWY MGS-GMRES based on lower-triangular solve, O ( ε ) κ ( A ) orthogonality • Recursive classical Gram-Schmidt (CGS-2), O ( ε ) orthogonality Corollary. The backward stability and loss of orthogonality are equivalent to the original algorithms 3

  4. One-reduce Gram-Schmidt Algorithms Björck (1967), Lemma 5.1 and Corollary, Ruhe (1983) Modified Gram-Schmidt projector P M a j = ( I − Q j − 1 T M Q T j − 1 ) a j inverse compact WY form T M = ( I + L j − 1 ) − 1 Classical Gram-Schmidt with reorthogonalization P IC a j = ( I − Q j − 1 T IC Q T j − 1 ) a j j − 1 Q j − 1 ) − 1 Q T where we are approximating P = I − Q j − 1 ( Q T j − 1 � � 0 0 T IC = ( Q T j − 1 Q j − 1 ) − 1 = I − L j − 1 = I − − w T 0 4

  5. Inverse Compact WY Modified Gram-Schmidt Q T Q = I + L + L T , compute L T one column per iteration Algorithm 1 Triangular Solve Modified Gram-Schmidt left-looking (columns) 1: for j = 1 , 2 ,... n do q j = a j 2: if j > 1 then 3: T 1 : j − 1 , j − 1 = Q T : , 1 : j − 1 q j − 1 ⊲ Synchronization 4: L 1 : j − 1 , 1 : j − 1 = tril ( T 1 : j − 1 , 1 : j − 1 , − 1 ) 5: R 1 : j − 1 , j = ( I + L 1 : j − 1 , 1 : j − 1 ) − 1 Q T ⊲ Lower triangular solve : , 1 : j − 1 a j 6: q j = q j − Q : , 1 : j − 1 R 1 : j − 1 , j 7: end if 8: R jj = � q j � 2 9: q j = q j / R jj 10: 11: end for 5

  6. Algorithm 2 One reduction MGS-GMRES with Lagged Normalization 1: r 0 = b − Ax 0 , v 1 = r 0 . 2: v 2 = Av 1 3: ( V 2 , R , L 2 ) = mgs ( V 2 , R , L 1 ) 4: for i = 1 , 2 ,... do v i + 2 = Av i + 1 ⊲ Matrix-vector product 5: [ L T : , i + 1 , r i + 2 ] = V T i + 1 [ v i + 1 v i + 2 ] ⊲ Global synchronization 6: r i + 1 , i + 1 = � v i + 1 � 2 7: v i + 1 = v i + 1 / r i + 1 , i + 1 ⊲ Lagged normalization 8: r 1 : i + 1 , i + 2 = r 1 : i + 1 , i + 2 / r i + 1 , i + 1 ⊲ Scale for Arnoldi 9: v i + 2 = v i + 2 / r i + 1 , i + 1 ⊲ Scale for Arnoldi 10: r i + 1 , i + 2 = r i + 1 , i + 2 / r i + 1 , i + 1 11: L T : , i + 1 = L T : , i + 1 / r i + 1 , i + 1 12: r 1 : i + 1 , i + 2 = ( I + L i + 1 ) − 1 r 1 : i + 1 , i + 2 ⊲ Lower triangular solve 13: v i + 2 = v i + 2 − V i + 1 r 1 : i + 1 , i + 2 14: H i = R i + 1 15: Apply Givens rotations to H i 16: 17: end for 18: y m = argmin � ( H m y m −� r 0 � 2 e 1 ) � 2 19: x = x 0 + V m y m 6

  7. Normwise Relative Backward Error Stopping criterion for MGS-GMRES • When does MGS-GMRES reach minimum error level ? � r k � 2 = � b − Ax k � 2 < tol � b � 2 � b � 2 flattens when � S � = 1 and the columns of V k become linearly dependent • Paige, Rozložnik and Strakoš (2006), normwise relative backwards error � r k � 2 < O ( ε ) � b � 2 + � A � 2 � x � 2 achieved when � S � 2 = 1. • Paige notes: For a sufficiently nonsingular matrix σ min ( A ) ≫ n 2 ε � A � F can employ MGS-GMRES to solve Ax = b with NBRE stopping criterion. 7

  8. Normwise Relative Backward Error Figure: Greenbaum, Rozložnik and Strakoš (1997). impcol_e matrix 8

  9. Low-Synch GMRES Multi-MPI Multi-GPU ´ Swirydowicz et al (2019), Low-Synch Gram-Schmidt and GMRES (NLAA) (a) Hypre Level-1 BLAS MGS-GMRES (b) Low-Synch ICWY MGS-GMRES Figure: Performance on Eagle Skylake + V100 GPU. NREL ABL Mesh n = 9M • Extended strong scaling roll-off by 4x for ExaWind ABL grid on Skylake + V100. 9

  10. Low-Synch GMRES Multi-MPI Multi-GPU (a) Hypre Gram-Schmidt Times (b) Low-Synch Gram-Schmidt Time Figure: Gram-Schmidt kernels time on Eagle Skylake + V100 GPU. NREL ABL Mesh n = 9M. Gram-Schmidt times are flat for low-synch algorithm(s) 10

  11. McAlister Blade Figure: McAlister blade. Computational mesh. 11

  12. McAlister Blade Figure: McAlister blade. Tip vortex 12

  13. GMRES-AMG reduced iterations with CGS-2 projection Figure: Projected x 0 . CGS-2. Nalu-Wind pressure solve. McAlister blade. 13

  14. Iterative Classical Gram-Schmidt CGS-2 Algorithm 3 Iterated Classical Gram-Schmidt (CGS-2) Algorithm Input: Matrices Q j − 1 and R j − 1 , A j − 1 = Q j − 1 R j − 1 ; column vector q j = a j Output: Q j and R j , such that A j = Q j R j 1: for k = 1 , 2 do s ( k − 1 ) = Q T j − 1 a ( k − 1 ) ⊲ Global synch 2: r ( k ) 1 : j − 1 , j = r ( k ) 1 : j − 1 , j + s ( k − 1 ) 3: q j = q j − Q j − 1 s ( k − 1 ) 4: 5: end for 6: q j = a ( k ) / r jj = � a ( k ) � 2 ⊲ Global synch 7: r j = [ r 1 : j − 1 , j , r jj ] 14

  15. Recursive Classical Gram-Schmidt CGS-2 Orthogonal projector takes the form, A = QR , P IC I − [ Q j − 2 , q j − 1 − Q j − 2 Q T j − 2 q j − 1 ] T j − 1 [ Q j − 2 , q j − 1 ] T = j − 2 Q j − 2 ) − 1 = T j − 1 = I − L j − 1 − L T Correction matrix ( Q T j − 1 takes the form � � � I 0 � T j − 2 T : , 1 : j − 2 T j − 1 = = − w T α − 1 T 1 : j − 2 , : t j − 1 , j − 1 1 w T = L j − 2 , : . Thus we have, r 1 : j − 2 , j = z , α = r − 1 j − 1 , j − 1 and where � r j − 1 , j − 1 , � � � j − 1 q j − 1 − w T w ) , j − 1 a j − w T z ) � Q T Q T ( q T ( q T � � j − 2 q j − 1 , j − 2 a j � w , z = , r j − 1 , j = Faster solver and more accurate eigenvalue computations based on Lanczos/Arnoldi algorithms PETSc-SLEPc. Corrects the Hernandez, Roman, Tomas (2005), (2007) DCGS-2 algorithm Rolling Stones Theorem: You can’t always get what you want, But if you try sometime you find, you get what you need 15

  16. Algorithm 4 Classical Gram-Schmidt (CGS-2) Algorithm with Normalization Lag and Reorthogonalization Lag Input: Matrices Q j − 1 and R j − 1 , A j − 1 = Q j − 1 R j − 1 ; column vector q j = a j Output: Q j and R j , such that A j = Q j R j 1: if j = 1 return 2: [ r j − 1 , j − 1 , r j − 1 , j ] = q T j − 1 [ q j − 1 , q j ] ⊲ Global synch 3: if j > 2 then [ w , z ] = Q T j − 2 [ q j − 1 , q j ] , ⊲ same global synch 4: [ r j − 1 , j − 1 , r j − 1 , j ] = [ r j − 1 , j − 1 − w T w , r j − 1 , j − w T z ] ⊲ Pythagorean 5: r 1 : j − 2 , j − 1 = r 1 : j − 2 , j − 1 + w ⊲ Lagged R update 6: 7: end if 8: r j − 1 , j − 1 = { r j − 1 , j − 1 } 1 / 2 ⊲ Lagged norm 9: if j > 2 q j − 1 = q j − 1 − Q j − 2 w ⊲ Lagged Reorthogonalization 10: q j − 1 = q j − 1 / r j − 1 , j − 1 ⊲ Lagged Normalization 11: r j − 1 , j = r j − 1 , j / r j − 1 , j − 1 12: r 1 : j − 2 , j = z ⊲ Apply recursive projection 13: q j = q j − Q j − 1 r j Pythagorean form (can) mitigate cancellation. a 2 − b 2 = ( a − b )( a + b ) Smoktunowicz, Barlow, Langou (2008). Ghysels et al (2013) 16

  17. Backward (representation) error Figure: Representation error for classical Gram-Schmidt 17

  18. Orthogonality Figure: Orthogonality for classical Gram-Schmidt 18

  19. Trilinos s -step GMRES Figure: McAlister Solve Time. 19

  20. Trilinos s -step GMRES Figure: Convergence for s -step GMRES. Laplacian matrix 20

  21. Trilinos s -step GMRES Figure: Convergence for s -step GMRES. Steam-1 matrix 21

  22. Trilinos s -step GMRES Figure: Time per iteration s -step GMRES. 22

  23. Trilinos s -step GMRES Figure: Time per iteration s -step GMRES. 23

  24. Trilinos s -step GMRES Figure: Strong scaling s -step GMRES versus one-sync GMRES. 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend