Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - PowerPoint PPT Presentation

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydın Buluç, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1

Sparse Matrix-Dense Vector Multiplication (SpMV) A T y A x y x A is an n-by-n sparse matrix with nnz << n 2 nonzero s Applications: • Iterative methods for solving linear systems : Krylov subspace methods based on Lanczos biorthogonalization: Biconjugate gradients (BCG) & quasi-minimal residual (QMR ) • Graph analysis: Betweenness centrality computation 2

The Landscape: Where does our work fit? Hardware specific Matrix specific optimizations optimizations (permutations, index/value (prefetching, TLB compression, register blocking) blocking, vectorization) Plenty of parallelism (for any nonzero distribution ) Our Contribution This is our plane of focus ! Equally fast y=Ax and y=A T x (simultaneously) 3

Theoretical and Experimental: Main Results Our parallel algorithms for y Ax and y A T x using the new compressed sparse blocks ( CSB ) layout has   • span, and work, ( n lg n ) ( nnz )  • yielding parallelism. ( nnz / n lg n ) 600 Our CSB algorithms 500 400 MFlops/sec 300 Serial 200 (Naïve CSR) 100 Star-P 0 1 2 3 4 5 6 7 8 (CSR+blockrow Processors distribution) 4

Compressed Sparse Rows (CSR): A Standard Layout Dense collection of “sparse rows” Row pointers 0 4 8 10 n 11 12 13 16 17 n × n matrix with colind nnz nonzeroes 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7 data • Stores entries in row-major order  • n lg nnz nnz lg n Uses bits of index data. • Reading rows in parallel is easy, but columns is hard. 5

Parallelizing SpMV_T is hard using the standard CSR format CSR_SPMV_T(A,x,y) for i  0 to n-1 do for k  A.rowptr[i] to A.rowptr[i+1]-1 do y[A.colind[k]]  y[A.colind[k]] + A.data [k]∙x[ i] 1. Parallelize the outer loop? × Race conditions on vector y . a. Locking on y is not scalable. b. Using p copies of y is work-inefficient A T y x 2. Parallelize the inner loop? × Span is ,  ( n )  ( nnz / n ) thus parallelism at most 6

Compressed Sparse Blocks (CSB) Dense collection of “sparse blocks” Block pointer β =4 0,0 0,1 1,0 1,1 0,0 0,1 n data 1,0 1,1 rowind 0 1 1 0 0 2 2 3 0 1 1 0 1 2 2 2 3 n × n matrix with colind nnz nonzeroes, 0 0 1 2 3 2 3 3 0 1 3 0 1 0 1 2 3 in β × β blocksasd • Store blocks in row-major order (*) • Store entries within block in Z-morton order (*)   • For , matches CSR on storage. n Reading blockrows or blockcolumns in parallel is now easy. 7

CSB Matrix-Vector Multiplication Our algorithm uses three levels of = parallelism 1) Multiply each blockrow in = parallel, each writing to a disjoint output subvector. 2) If a blockrow is “dense,” parallelize the blockrow multiplication. + = 3) If a single block is dense, parallelize the = block multiplication. 8

Blockrow-Vector Multiplication Until things   nnz recurse in parallel =   = = n   r nonzeros then sum results = + • Divide-and-conquer based on the nonzero count, not spatial. • Allocation & accumulation costs of temporary vectors are amortized.   • Lemma: For , our parallel blockrow-vector n    multiplication has work and span on a n ( r ) O ( n lg n )   blockrow containing nonzeroes. r 9

Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • With Z-morton ordering, spatial division to quadrants takes lg dim dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 10

Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 11

Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches.   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 12

Main Theorem and The Choice of β Theorem: Our parallel matrix-vector multiplication has   work and span   , yielding     2 2 O ( n nnz ) O ( lg n n ) n  nonzeroes . nnz  on an CSB matrix containing n n    For ,this yields a parallelism of n ( nnz / n lg n ) On our test matrices, parallelism ranges from 186 to 3498   n Sensitivity to β in theory Sensitivity to β in practice 13

Ax and A T x perform equally well Matrix-Vector Product Matrix-Transpose-Vector Product 1 proc 2 procs 1 proc 2 procs 800 800 4 procs 8 procs 4 procs 8 procs 700 700 600 600 500 500 MFlops/sec 400 400 300 300 200 200 100 100 0 0 4-socket dual-core 2.2GHz AMD Opteron 8214 Most test matrices had similar performance when multiplying by the matrix and its transpose. 14

Test Matrices and Performance Overview CSB_SpMV CSB_SpMV_T Star-P_SpMV Star-P_SpMV_T Serial CSR_SPMV Serial CSR_SpMV_T 600 500 400 MFlops/sec 300 200 100 0 1 2 3 4 5 6 7 8 Processors 15

Reality Check and Related Work FACT: Sparse matrix-dense vector multiplication (and the transpose) is bandwidth limited . This work: motivated by multicore/manycore architectures where parallelism and memory bandwidth are key resources. • Previous work mostly focused on reducing communication volume in distributed memory, often by using graph or hypergraph partitioning [Catalyurek & Aykanat ’99]. • Great optimization work for SpMV on multicores by Williams, et al. ‘09, but without parallelism guarantees or multiplication with the transpose (SpMV_T). • Blocking for sparse matrices is not new, but mostly for cache performance, not parallelism [Im et al.’04, Nishtala et al. ’07]. 16

Good Speedup Until Bandwidth Limit • 16 Slowed down processors (artificially 14 introduced extra 12 instructions) for test to 10 Speedup hide memory issues. 8 • 6 Shows algorithm 4 scales well given sufficient memory 2 bandwidth. 0 0 2 4 6 8 10 12 14 16 Processors Ran on the smallest (and one of the most irregular) test matrix 17

All about Bandwidth: Harpertown vs Nehalem Intel Xeon X5460 @3.16Ghz Intel Core i7 920 @2.66Ghz Dual-socket Quad-core Single-socket Quad-core FSB @1333Mhz Quickpath+Hyperthreading 18

Conclusions & Future Work • CSB allows for efficient multiplication of a sparse matrix and its transpose by a dense vector. Future Work: • Does CSB work well with other computations? Sparse LU decomposition? Sparse matrix-matrix multiplication? • For a symmetric matrix, need only store upper triangle of matrix. Can we multiply with one read (i.e., ½ the bandwidth)? Code (in C++ and Cilk++) available from: http://gauss.cs.ucsb.edu/ ∼ aydin/software.html

Thank You !

CSB Space Usage    n lg nnz nnz lg n Lemma: For , CSB uses bits of n index data, matching CSR. Proof:   n n n blocks => n block pointers. Block pointer • Each block pointer uses lg nnz bits, for total n lg nnz • Each row (or column) offset rowind requires bits, for   lg lg n colind total. ( nnz / 2 ) lg n nnz

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - PowerPoint PPT Presentation

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydn Bulu, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

NLA Reading Group Spring13 by smail Ar is a linear combination of the columns of

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Arrays: computing with many numbers Some perspective We have so far (mostly) looked at what we

Lecture 1.1: Vector spaces Matthew Macauley Department of Mathematical Sciences Clemson

Vector Spaces Linear Independence, Bases and Dimension Marco Chiarandini Department of

Vector Barrier Certificates and Comparison Systems Andrew Sogokon 1 Khalil Ghorbal 2 Yong Kiam Tan

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t

A structure-driven performance analysis of sparse matrix-vector multiplication Prabhjot Sandhu ,