Massive Asynchronous Parallelization of Sparse Matrix Factorizations - PowerPoint PPT Presentation

Massive Asynchronous Parallelization of Sparse Matrix Factorizations Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities Chicago, IL, July 6, 2014

Contribution A new fine-grained parallel approach for approximately computing certain sparse matrix factorizations, A ≈ XY ◮ Approach designed for large amounts of fine-grained parallelism, such as in GPUs, Intel Xeon Phi, and future processors ◮ Each entry of X and Y computed in parallel (updated iteratively and asynchronously) ◮ Main example: incomplete LU factorizations, A ≈ LU ◮ Potentially applicable to matrix completions, sparse approximate inverses, perhaps others SIAM EX14 2

Conventional ILU factorization Given sparse A , compute LU ≈ A , where L and U are sparse. Define S to be the sparsity pattern, ( i , j ) ∈ S if l ij or u ij can be nonzero. for i = 2 to n do for k = 1 to i − 1 and ( i , k ) ∈ S do a ik = a ik / a kk for j = k + 1 to n and ( i , j ) ∈ S do a ij = a ij − a ik a kj end end end SIAM EX14 3

Existing parallel ILU methods At each step, find all rows that can be eliminated in parallel (rows that only depend of rows already eliminated) Level scheduling ILU Regular grids: van der Vorst 1989, Joubert-Oppe 1994 Irregular problems: Heroux-Vu-Yang 1991, Pakzad-Lloyd-Phillips 1997, Gonzalez-Cabaleiro-Pena 1999, Dong-Cooperman 2011, Gibou-Min 2012, Naumov 2012 Multicolor reordering ILU Poole-Ortega 1987, Elman-Agron 1989, Jones-Plassmann 1994, Nakajima 2005, Li-Saad 2010, Heuveline-Lukarski-Weiss 2011 Domain decomposition ILU Ma-Saad 1994, Karypis-Kumar 1997, Vuik et al 1998, Hysom-Pothen 1999, Magolu monga Made-van der Vorst 2002 SIAM EX14 4

New fine-grained parallel ILU ◮ Does not use level scheduling or reordering of the matrix ◮ Each nonzero in L and U computed in parallel ◮ Can use asynchronous computations (helps tolerate latency and performance irregularities) SIAM EX14 5

Reformulation of ILU An ILU factorization with sparsity pattern S has the property ( LU ) ij = a ij , ( i , j ) ∈ S . Instead of Gaussian elimination, we compute the unknowns l ij , i > j , ( i , j ) ∈ S u ij , i ≤ j , ( i , j ) ∈ S using the constraints min ( i , j ) ∑ l ik u kj = a ij , ( i , j ) ∈ S . k = 1 If the diagonal of L is fixed, then there are | S | unknowns and | S | constraints. SIAM EX14 6

Solving the constraint equations The equation corresponding to ( i , j ) gives � � j − 1 1 ∑ = a ij − , i > j l ij l ik u kj u jj k = 1 i − 1 ∑ = a ij − l ik u kj , i ≤ j . u ij k = 1 The equations have the form x = G ( x ) . It is natural to try to solve these equations via a fixed-point iteration x ( k + 1 ) = G ( x ( k ) ) with an initial guess x ( 0 ) . We update each component of x ( k + 1 ) in parallel and asynchronously (each thread uses latest values available). SIAM EX14 7

Fine-grained ILU algorithm Set unknowns l ij and u ij to initial values for sweep = 1 , 2 ,... until convergence do parallel for ( i , j ) ∈ S do if i > j then � � j − 1 l ij = a ij − ∑ / u jj k = 1 l ik u kj else u ij = a ij − ∑ i − 1 k = 1 l ik u kj end end end Arithmetic intensity can be tuned by controlling how often updates are exposed to other threads, at the cost of convergence degradation. Actual implementation uses sparse data structures. SIAM EX14 8

Solving the constraint equations (square or overdetermined) Instead of solving nonlinear equations, solve least squares minimization problem, min ∑ ( a ij − ∑ k l ik u kj ) 2 ( i , j ) ∈ S for example, using stochastic gradient descent (SGD) e ij = a ij − l i , : · u : , j l i , : = l i , : + γ · e ij u T : , j u : , j = u : , j + γ · e ij l T i , : for each ( i , j ) ∈ S , where γ is a steplength parameter. Here, multiple unknowns are updated at each step. Related to nonlinear Kaczmarz. The asynchronous version of SGD is called “Hogwild!” (Niu-Recht-Re-Wright 2012). SIAM EX14 9

Low rank matrix completion problems Given a subset of the entries in A ,   a ij a ij     a ij a ij     a ij   a ij a ij find all the other entries, assuming A has rank r , A = FG . Our approach is to solve the bilinear equations ( FG ) ij = a ij , ( i , j ) ∈ S or associated nonlinear least squares problem using an asynchronous iterative method. Existing parallel solution approaches use stochastic gradient descent or alternating least squares. SIAM EX14 10

Sparse approximate inverses Given a SPD matrix A , consider G T G ≈ A − 1 where G is triangular. Our approach is to asynchronously solve ( GAG T ) ij = I ij , ( i , j ) ∈ S . Like before, the preconditioner G T G is easily updated when the matrix A changes. SIAM EX14 11

New approach Write matrix factorizations as bilinear equations and then solve asynchronously ◮ More bilinear equations than original equations ◮ Equations are nonlinear Potential advantages ◮ Do not need to solve the nonlinear equations exactly (no need to compute the incomplete factorization exactly) ◮ Nonlinear equations may have a good initial guess (e.g., time-dependent problems) ◮ Rich structure in the nonlinear equations that can be exploited SIAM EX14 12

Data dependencies i,j i,j Formula for unknown at ( i , j ) (dark square) depends on A Gaussian elimination order: other unknowns left of ( i , j ) in L and above ( i , j ) in U Entries ( i , j ) ∈ S in darker (shaded regions). rows and columns are ordered before those in lighter rows and columns. If the unknowns are computed in a Gaussian elimination order, then they are computed exactly in a single sweep. SIAM EX14 13

Convergence Convergence is related to the Jacobian J ( x ) of G ( x ) . The partial derivatives in the row of J ( x ) corresponding to l ij are � � j − 1 − l ik − u kj − 1 ∑ , , a ij − , k ≤ j − 1 . l ik u kj u 2 u jj u jj k = 1 jj The partial derivatives corresponding to u ij are − l ik , − u kj , k ≤ i − 1 . ◮ G ( x ) is F-differentiable on its domain because its partial derivatives are well defined and continuous. ◮ J ( x ) is sparse and has all zeros on the diagonal. ◮ J ( x ) can be symmetrically reordered to be strictly lower triangular (Gaussian elimination ordering). SIAM EX14 14

Structure of the Jacobian 5-point matrix on 4x4 grid SIAM EX14 15

Local convergence Local convergence of the synchronous iteration ◮ Spectral radius of J ( x ) is zero for all x ◮ G ( x ) is F-differentiable From Ostrowski’s theorem, if G ( x ) has a fixed point x ∗ , then x ∗ is a point of attraction of the synchronous fixed point iteration. Local convergence of the asynchronous iteration ◮ Assume all variables are eventually updated, etc. ◮ Spectral radius of | J ( x ) | is zero for all x ◮ G ( x ) is F-differentiable From Bertsekas (1983), if G ( x ) has a fixed point x ∗ , then x ∗ is a point of attraction of the asynchronous fixed point iteration. SIAM EX14 16

Transient convergence ◮ Possible for nonlinear iterations to diverge before converging ◮ Want norm of J ( x ) to be small, and thus the partial derivatives of J ( x ) to be small (want u jj to be large) ◮ This heuristic guides the choice to scale A symmetrically such that its diagonal is all ones SIAM EX14 17

Contraction mapping property Theorem. If A is a 5-point or 7-point finite difference matrix, and if ˜ L ( ˜ U ) has sparsity pattern equal to the strictly lower (upper) triangular part of A , then � J (˜ L , ˜ U ) � 1 = max ( � ˜ L � max , � ˜ U � max , � A L ∗ � 1 ) where A L ∗ is the strictly lower triangular part of A and �·� max denotes the maximum absolute value among the entries in the matrix. Example. Any diagonally dominant 5-point or 7-point finite difference matrix has � J ( x ( 0 ) ) � 1 < 1 where x ( 0 ) is the standard initial guess. Example. For a 5-point or 7-point centered-difference approximation of the Laplacian, � J ( x ( 0 ) ) � 1 = 0 . 5. In the 10 × 10 mesh case, � J ( x ∗ ) � 1 ≈ 0 . 686. SIAM EX14 18

Measuring convergence of the nonlinear iterations Nonlinear residual � � min ( i , j ) � � � ( A − LU ) S � F = ∑ ∑ � a ij − l ik u kj � � � � k = 1 ( i , j ) ∈ S � This differs from what we actually want to minimize: ILU residual � A − LU � F Convergence of the preconditioned linear iterations is known to be strongly related to the ILU residual (Duff-Meurant 1989). SIAM EX14 19

Numerical tests for new parallel ILU algorithm ◮ Do the asynchronous iterations converge? ◮ How fast is convergence with the number of threads? ◮ How good are the approximate factorizations as preconditioners? Measure performance in terms of solver iteration count. Tests on Intel Xeon Phi. Initial L and U are the lower and upper triangular parts of A . Use Gaussian elimination ordering for the nonlinear equations. SIAM EX14 20

Massive Asynchronous Parallelization of Sparse Matrix Factorizations - PowerPoint PPT Presentation

Massive Asynchronous Parallelization of Sparse Matrix Factorizations Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, Threads and GPUs Karl Rupp

Divisors on matroids and their volumes Christopher Eur Department of Mathematics University of

Bayesian Networks Part 2 Yingyu Liang Computer Sciences 760 Fall 2017

Evolution of White-Box Cryptography: From Table-Based Implementations to Recent Designs

CS5530 Mobile/Wireless Systems Android UI Yanyan Zhuang Department of Computer Science

White-Box Cryptography Don't Forget About Grey Box Attacks Joppe W. Bos Real World Crypto 2017

Mathematics and Science Education: Funding Opportunities at IES Christina Chhin Elizabeth Albro

Vertical Search Engines Web Searching Current challenge: finding relevant results for targeted