exploiting matrix reuse and data locality in sparse
play

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector - PowerPoint PPT Presentation

Matrix reuse and data locality in parallel y = A z and z = A T x Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures Kadir Akbudak Ozan Karsavuran 1 (speaker)


  1. Matrix reuse and data locality in parallel y = A z and z = A T x Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures Kadir Akbudak Ozan Karsavuran 1 (speaker) 2 Cevdet Aykanat 1 sites.google.com/site/kadircs kadir.cs@gmail.com 1 Bilkent University, Turkey 2 KAUST, KSA SIAM Workshop on Combinatorial Scientific Computing (CSC), Albuquerque, NM, USA, October 10-12, 2016 O. Karsavuran, K. Akbudak , and C. Aykanat, Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures , IEEE Transactions on Parallel and Distributed Systems (TPDS) , vol. 27(6), pp. 1713-1726, 2016, available at ieeexplore.ieee.org/document/7152923/ 1 / 14

  2. Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x 1 Open problems & Related work 2 Parallel SpAA T based on 1D partitioning of A and A T matrices 3 Quality criteria for efficient parallelization of SpAA T Proposed SpAA T algorithms Experiments References 4 2 / 14

  3. Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x Thread-level parallelization of y = AA T x ( SpAA T ) y = AA T x is computed as two Sparse Matrix-Vector Multiplies ( SpMV ) x Sparse Matrix- Transpose–Vector z = A T x and then z A T Multiply ( Sp A T ) z Sparse Matrix-Vector y A y = Az Multiply ( Sp A ) Thread-level parallelization of repeated and consecutive Sp A and Sp A T that involve the same sparse matrix A Examples: Linear Programming (LP) problems via interior point methods nonsymmetric systems via Bi-CG, CGNE, Lanczos Bi-ortagonalization least squares problem via LSQR linear feasibility problem via Surrogate Constraints method Krylov-based balancing algorithms used as preconditioners for sparse eigensolvers web page ranking via HITS algorithm 3 / 14

  4. Matrix reuse and data locality in parallel y = A z and z = A T x Open problems & Related work Open problems Utilize the opportunity of reusing A -matrix nonzeros? Obtain close performance for both z = A T x and y = Az at the same time? Single storage of A for both z = A T x and y = Az Storage of A T for z = A T x and a separate storage of A for y = Az Related work Optimized Sparse Kernel Interface (OSKI), Berkeley x Serial z y Each row/column is reused. A z A T Compressed Sparse Blocks (CSB) by Buluc et. al. [10] Parallel Same data structure for both Sp A and Sp A T operations without any performance degradation Two phase, i.e., Sp A and Sp A T are not performed simultaneously 4 / 14

  5. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Thread-level baseline parallelization of SpAA T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z R T z 1 y 1 R 1 1 R T z 2 y 2 R 2 2 = y = C T C T C T C T = z = C 1 C 2 C 3 C 4 R T z 3 1 2 3 4 y 3 R 3 3 R T z 4 y 4 R 4 4 A T RRp A A T A CCp Row-Row parallel Column-Column parallel z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z C T z 1 y 1 R 1 1 C T z 2 y 2 R 2 2 R T R T R T R T y = z = C 1 C 2 C 3 C 4 = = 1 2 3 4 C T z 3 y 3 R 3 3 C T z 4 y 4 R 4 4 A T A T CRp RCp A A Column-Row parallel Row-Column parallel YELLOW scale tone: exclusive accesses by a single thread RED color: concurrent accesses by multiple threads. Four baseline SpAA T algorithms for computing y = A z after z = A T x by four threads. 5 / 14

  6. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Contributions Identify five quality criteria (QC), which have impact on performance of parallel SpAA T Singly-bordered block-diagonal (SB) form based methods: sb CRp and sb RCp Matrix A partitioned in to four and the subma- z 1 z 2 z 3 z 4 trices are processed by four threads. y 1 A 11 For sb CRp (SB-based Column-Row parallel algorithm), y 2 A 22 we permute matrix A into a rowwise SB form, which y 3 = A 33 induces a columnwise SB form of matrix A T y 4 A 44 z 1 z 2 z 3 z 4 z B y B A B 1 A B 2 A B 3 A B 4 y 1 A 11 A B 1 For sb RCp (SB-based Row-Column parallel algorithm), y 2 we permute matrix A into a columnwise SB form, A 22 A B 2 = which induces a rowwise SB form of matrix A T y 3 A 33 A B 3 y 4 A 44 A B 4 Achieve (a) ( z -vector reuse) and (b) ( A -matrix reuse). Objectives of minimizing the size of the row/column border in the SB form of A ≈ achieve QC (c), (d), and (e) in sb CRp/ sb RCp. 6 / 14

  7. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Quality criteria for efficient parallelization of SpAA T Quality criteria for efficient parallelization of SpAA T Quality Criteria RRp CRp RCp sb CRp sb RCp – (a)Reusing z -vector entries generated in z = A T x and 1 � � � × × then read in y = A z – 2 (b)Reusing matrix nonzeros (together with their in- � � � × × dices) in z = A T x and y = A z × 3 × 3 × 3 (c) Exploiting temporal locality in reading input vector � � entries in row-parallel SpMVs × 3 × 3 (d)Exploiting temporal locality in updating output vec- � � − tor entries in column-parallel SpMVs (e) Minimizing the number of concurrent writes per- � � � × × formed by different threads in column-parallel SpMVs z 1 z 2 z 3 z 4 x z 1 C T – 1 1 : satisfied except z B border subvectors � : satisfied � z 2 C T y = 2 – C 1 C 2 C 3 C 4 = 2 : satisfied except A kB border submatrices − : not applicable � C T z 3 3 × 3 : may be satisfied through row/column reordering × : not satisfied C T z 4 4 A T A CRp Column-Row parallel 7 / 14

  8. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Maintaining balance on the number of nonzeros at each slice Reducing parallel time under arbitrary task scheduling Reducing border size x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 × × × × × × × × × × × × × × × × c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 Reducing # of cache misses due to loss r 1 r 1 × ×× × × of temporal locality r 2 r 4 × × r 3 ×× r 2 × λ ( c j ) = |{ A k : r 4 × r 3 × c j has at least one nonzero at A k , r 5 r 5 × × r 6 r 6 × × × × ∀ k ∈ 1 , . . . , K }| � λ ( c j ) = 3 + 5 � λ ( c j ) = 3 + 3 Matrix A partitioned in to three and the submatrices are processed by three threads. Reducing # of concurrent writes λ ( r i ) = |{ A k : r i has at least one nonzero at A k , ∀ k ∈ 1 , . . . , K }| 8 / 14

  9. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Merits of Singly-Bordered Block Diagonal (SB) Form on CRp SB Form z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B x y 1 A 11 z 1 A T A T C T z 1 11 B 1 1 y 2 A 22 z 2 A T A T z 2 C T 22 B 2 y = 2 y 3 = = A 33 C 1 C 2 C 3 C 4 = A T A T z 3 C T z 3 y 4 33 B 3 3 A 44 z 4 C T z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 4 B 4 A T A T A A CRp sb CRp Concurrent accesses Whole x and y vectors Only x B and y B subvectors Exploits temporal locality in reading x -vector entries in row parallel z = A T x Exploits temporal locality in updating y -vector entries in column-parallel y = A z Minimizing number of concurrent writes by Minimizing border different threads in column-parallel y = A z size in the SB form 9 / 14

  10. Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Require: A kk and A Bk matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B z k ← A T kk x k 2: z k ← C T y 1 k x A 11 A T z k ← z k + A T z 1 A T Bk x B 3: 11 B 1 y 2 y k ← A kk z k A 22 4: z 2 A T A T y ← C k z k ⊲ Concurrent 22 B 2 y B ← y B + A Bk z k y 3 5: = = A 33 writes A T A T z 3 6: end for y 4 33 B 3 A 44 z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 B 4 A T A sb CRp SB-based Column-Row parallel Require: A kk and A kB matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do x 1 x 2 x 3 x 4 z 1 z 2 z 3 z 4 z B z k ← A T kk x k 2: z ← R T k x k kB x k ⊲ Concurrent z B ← z B + A T A T z 1 3: y 1 A 11 A 1 B 11 writes y k ← A kk z k z 2 A T 4: 22 y 2 A 22 A 2 B 5: end for A T z 3 = = 33 y k ← R k z 6: for k ← 1 to K in parallel do y 3 A 33 A 3 B z 4 A T 44 y k ← y k + A kB z B 7: y 4 A T A T A T A T A 44 A 4 B z B 1 B 2 B 3 B 4 B 8: end for A T sb RCp A SB-based Row-Column parallel 10 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend