parallel sparse matrix vector and matrix
play

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - PowerPoint PPT Presentation

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydn Bulu, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1


  1. Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydın Buluç, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1

  2. Sparse Matrix-Dense Vector Multiplication (SpMV) A T y A x y x A is an n-by-n sparse matrix with nnz << n 2 nonzero s Applications: • Iterative methods for solving linear systems : Krylov subspace methods based on Lanczos biorthogonalization: Biconjugate gradients (BCG) & quasi-minimal residual (QMR ) • Graph analysis: Betweenness centrality computation 2

  3. The Landscape: Where does our work fit? Hardware specific Matrix specific optimizations optimizations (permutations, index/value (prefetching, TLB compression, register blocking) blocking, vectorization) Plenty of parallelism (for any nonzero distribution ) Our Contribution This is our plane of focus ! Equally fast y=Ax and y=A T x (simultaneously) 3

  4. Theoretical and Experimental: Main Results Our parallel algorithms for y Ax and y A T x using the new compressed sparse blocks ( CSB ) layout has   • span, and work, ( n lg n ) ( nnz )  • yielding parallelism. ( nnz / n lg n ) 600 Our CSB algorithms 500 400 MFlops/sec 300 Serial 200 (Naïve CSR) 100 Star-P 0 1 2 3 4 5 6 7 8 (CSR+blockrow Processors distribution) 4

  5. Compressed Sparse Rows (CSR): A Standard Layout Dense collection of “sparse rows” Row pointers 0 4 8 10 n 11 12 13 16 17 n × n matrix with colind nnz nonzeroes 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7 data • Stores entries in row-major order  • n lg nnz nnz lg n Uses bits of index data. • Reading rows in parallel is easy, but columns is hard. 5

  6. Parallelizing SpMV_T is hard using the standard CSR format CSR_SPMV_T(A,x,y) for i  0 to n-1 do for k  A.rowptr[i] to A.rowptr[i+1]-1 do y[A.colind[k]]  y[A.colind[k]] + A.data [k]∙x[ i] 1. Parallelize the outer loop? × Race conditions on vector y . a. Locking on y is not scalable. b. Using p copies of y is work-inefficient A T y x 2. Parallelize the inner loop? × Span is ,  ( n )  ( nnz / n ) thus parallelism at most 6

  7. Compressed Sparse Blocks (CSB) Dense collection of “sparse blocks” Block pointer β =4 0,0 0,1 1,0 1,1 0,0 0,1 n data 1,0 1,1 rowind 0 1 1 0 0 2 2 3 0 1 1 0 1 2 2 2 3 n × n matrix with colind nnz nonzeroes, 0 0 1 2 3 2 3 3 0 1 3 0 1 0 1 2 3 in β × β blocksasd • Store blocks in row-major order (*) • Store entries within block in Z-morton order (*)   • For , matches CSR on storage. n Reading blockrows or blockcolumns in parallel is now easy. 7

  8. CSB Matrix-Vector Multiplication Our algorithm uses three levels of = parallelism 1) Multiply each blockrow in = parallel, each writing to a disjoint output subvector. 2) If a blockrow is “dense,” parallelize the blockrow multiplication. + = 3) If a single block is dense, parallelize the = block multiplication. 8

  9. Blockrow-Vector Multiplication Until things   nnz recurse in parallel =   = = n   r nonzeros then sum results = + • Divide-and-conquer based on the nonzero count, not spatial. • Allocation & accumulation costs of temporary vectors are amortized.   • Lemma: For , our parallel blockrow-vector n    multiplication has work and span on a n ( r ) O ( n lg n )   blockrow containing nonzeroes. r 9

  10. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • With Z-morton ordering, spatial division to quadrants takes lg dim dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 10

  11. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 11

  12. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches.   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 12

  13. Main Theorem and The Choice of β Theorem: Our parallel matrix-vector multiplication has   work and span   , yielding     2 2 O ( n nnz ) O ( lg n n ) n  nonzeroes . nnz  on an CSB matrix containing n n    For ,this yields a parallelism of n ( nnz / n lg n ) On our test matrices, parallelism ranges from 186 to 3498   n Sensitivity to β in theory Sensitivity to β in practice 13

  14. Ax and A T x perform equally well Matrix-Vector Product Matrix-Transpose-Vector Product 1 proc 2 procs 1 proc 2 procs 800 800 4 procs 8 procs 4 procs 8 procs 700 700 600 600 500 500 MFlops/sec 400 400 300 300 200 200 100 100 0 0 4-socket dual-core 2.2GHz AMD Opteron 8214 Most test matrices had similar performance when multiplying by the matrix and its transpose. 14

  15. Test Matrices and Performance Overview CSB_SpMV CSB_SpMV_T Star-P_SpMV Star-P_SpMV_T Serial CSR_SPMV Serial CSR_SpMV_T 600 500 400 MFlops/sec 300 200 100 0 1 2 3 4 5 6 7 8 Processors 15

  16. Reality Check and Related Work FACT: Sparse matrix-dense vector multiplication (and the transpose) is bandwidth limited . This work: motivated by multicore/manycore architectures where parallelism and memory bandwidth are key resources. • Previous work mostly focused on reducing communication volume in distributed memory, often by using graph or hypergraph partitioning [Catalyurek & Aykanat ’99]. • Great optimization work for SpMV on multicores by Williams, et al. ‘09, but without parallelism guarantees or multiplication with the transpose (SpMV_T). • Blocking for sparse matrices is not new, but mostly for cache performance, not parallelism [Im et al.’04, Nishtala et al. ’07]. 16

  17. Good Speedup Until Bandwidth Limit • 16 Slowed down processors (artificially 14 introduced extra 12 instructions) for test to 10 Speedup hide memory issues. 8 • 6 Shows algorithm 4 scales well given sufficient memory 2 bandwidth. 0 0 2 4 6 8 10 12 14 16 Processors Ran on the smallest (and one of the most irregular) test matrix 17

  18. All about Bandwidth: Harpertown vs Nehalem Intel Xeon X5460 @3.16Ghz Intel Core i7 920 @2.66Ghz Dual-socket Quad-core Single-socket Quad-core FSB @1333Mhz Quickpath+Hyperthreading 18

  19. Conclusions & Future Work • CSB allows for efficient multiplication of a sparse matrix and its transpose by a dense vector. Future Work: • Does CSB work well with other computations? Sparse LU decomposition? Sparse matrix-matrix multiplication? • For a symmetric matrix, need only store upper triangle of matrix. Can we multiply with one read (i.e., ½ the bandwidth)? Code (in C++ and Cilk++) available from: http://gauss.cs.ucsb.edu/ ∼ aydin/software.html

  20. Thank You !

  21. CSB Space Usage    n lg nnz nnz lg n Lemma: For , CSB uses bits of n index data, matching CSR. Proof:   n n n blocks => n block pointers. Block pointer • Each block pointer uses lg nnz bits, for total n lg nnz • Each row (or column) offset rowind requires bits, for   lg lg n colind total. ( nnz / 2 ) lg n nnz

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend