Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - - PowerPoint PPT Presentation

parallel sparse matrix vector and matrix
SMART_READER_LITE
LIVE PREVIEW

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - - PowerPoint PPT Presentation

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydn Bulu, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1


slide-1
SLIDE 1

1

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks

Aydın Buluç, UCSB

Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts)

slide-2
SLIDE 2

2

Sparse Matrix-Dense Vector Multiplication (SpMV)

Applications:

  • Iterative methods for solving linear systems: Krylov

subspace methods based on Lanczos biorthogonalization: Biconjugate gradients (BCG) & quasi-minimal residual (QMR)

  • Graph analysis: Betweenness centrality computation

y A x y AT x

A is an n-by-n sparse matrix with nnz << n2 nonzeros

slide-3
SLIDE 3

3

The Landscape: Where does our work fit?

Equally fast y=Ax and y=ATx (simultaneously) Plenty of parallelism (for any nonzero distribution) Hardware specific

  • ptimizations

(prefetching, TLB blocking, vectorization) Matrix specific optimizations (permutations, index/value compression, register blocking) This is our plane

  • f focus !

Our Contribution

slide-4
SLIDE 4

4

Theoretical and Experimental: Main Results

Our parallel algorithms for y Ax and y ATx using the new compressed sparse blocks (CSB) layout has

  • span, and

work,

  • yielding

parallelism.

) lg / ( n n nnz  ) (nnz 

) lg ( n n 

100 200 300 400 500 600 1 2 3 4 5 6 7 8

MFlops/sec Processors

Our CSB algorithms Star-P (CSR+blockrow distribution) Serial (Naïve CSR)

slide-5
SLIDE 5

5

Compressed Sparse Rows (CSR):

A Standard Layout

  • Stores entries in row-major order
  • Uses

bits of index data.

  • Reading rows in parallel is easy, but columns is hard.

Row pointers data 8 10 2 3 colind

n×n matrix with nnz nonzeroes

n 0 2 3 4 0 1 5 7 3 4 5 4 5 6 7 4 11 12 13 16 17

Dense collection of “sparse rows”

n nnz nnz n lg lg 

slide-6
SLIDE 6

6

Parallelizing SpMV_T is hard using the standard CSR format

CSR_SPMV_T(A,x,y) for i0 to n-1 do for kA.rowptr[i] to A.rowptr[i+1]-1 do y[A.colind[k]]  y[A.colind[k]] + A.data[k]∙x[i]

  • 1. Parallelize the outer loop?

× Race conditions on vector y.

a. Locking on y is not scalable. b. Using p copies of y is work-inefficient

  • 2. Parallelize the inner loop?

× Span is ,

thus parallelism at most

) (n  ) / ( n nnz 

y AT x

slide-7
SLIDE 7

7

Compressed Sparse Blocks (CSB)

  • Store blocks in row-major order (*)
  • Store entries within block in Z-morton order (*)
  • For , matches CSR on storage.

Reading blockrows or blockcolumns in parallel is now easy.

n×n matrix with nnz nonzeroes, in β×β blocksasd

n β =4 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 Block pointer data 0 1 1 0 0 2 2 3 0 1 1 0 1 2 2 2 3 rowind 0 0 1 2 3 2 3 3 0 1 3 0 1 0 1 2 3 colind

Dense collection of “sparse blocks”

n  

slide-8
SLIDE 8

8

CSB Matrix-Vector Multiplication

Our algorithm uses three levels of parallelism

=

1) Multiply each blockrow in parallel, each writing to a disjoint

  • utput subvector.

=

2) If a blockrow is “dense,” parallelize the blockrow multiplication.

= +

3) If a single block is dense, parallelize the block multiplication.

=

slide-9
SLIDE 9

9

= = = = +

recurse in parallel then sum results

n  

Blockrow-Vector Multiplication

  • Divide-and-conquer based on the nonzero count, not spatial.
  • Allocation & accumulation costs of temporary vectors are amortized.
  • Lemma: For , our parallel blockrow-vector

multiplication has work and span

  • n a

blockrow containing nonzeroes.

Until things 

 nnz

  r

nonzeros

n  

n     r

) lg ( n n O ) (r 

slide-10
SLIDE 10

10

Block-Vector Multiplication

  • With Z-morton ordering, spatial division to quadrants takes

time on a (sub)block using three binary searches

  • Lemma: For , our parallel block-vector multiplication has

work and span

  • n a block with r nonzeroes.

For any (sub)block, first perform A00 and A11 in parallel; then A01 and A10 in parallel.

dim dim

dim lg

n  

) (r 

) ( n O

Updates on y are race-free

A00 A01 A10 A11

slide-11
SLIDE 11

11

Block-Vector Multiplication

  • With Z-morton ordering, spatial division to quadrants takes

time on a (sub)block using three binary searches

  • Lemma: For , our parallel block-vector multiplication has

work and span

  • n a block with r nonzeroes.

dim dim

dim lg

n  

) (r 

) ( n O

Updates on y are race-free

For any (sub)block, first perform A00 and A11 in parallel; then A01 and A10 in parallel.

A00 A01 A10 A11

slide-12
SLIDE 12

12

Block-Vector Multiplication

  • With Z-morton ordering, spatial division to quadrants takes

time on a (sub)block using three binary searches.

  • Lemma: For , our parallel block-vector multiplication has

work and span

  • n a block with r nonzeroes.

dim dim

dim lg

n  

) (r 

) ( n O

Updates on y are race-free

For any (sub)block, first perform A00 and A11 in parallel; then A01 and A10 in parallel.

A00 A01 A10 A11

slide-13
SLIDE 13

Main Theorem and The Choice of β

13

Theorem: Our parallel matrix-vector multiplication has

work and span , yielding

  • n an

CSB matrix containing nonzeroes.

Sensitivity to β in theory Sensitivity to β in practice

 

) lg (    n n O 

) (

2 2

nnz n O  

For ,this yields a parallelism of On our test matrices, parallelism ranges from 186 to 3498

n  

) lg / ( n n nnz 

n n

n nnz 

n  

slide-14
SLIDE 14

14

Ax and ATx perform equally well

4-socket dual-core 2.2GHz AMD Opteron 8214 Most test matrices had similar performance when multiplying by the matrix and its transpose.

Matrix-Vector Product Matrix-Transpose-Vector Product

100 200 300 400 500 600 700 800 MFlops/sec 1 proc 2 procs 4 procs 8 procs 100 200 300 400 500 600 700 800 1 proc 2 procs 4 procs 8 procs

slide-15
SLIDE 15

15 100 200 300 400 500 600 1 2 3 4 5 6 7 8

MFlops/sec Processors

CSB_SpMV CSB_SpMV_T Star-P_SpMV Star-P_SpMV_T Serial CSR_SPMV Serial CSR_SpMV_T

Test Matrices and Performance Overview

slide-16
SLIDE 16

16

Reality Check and Related Work

FACT: Sparse matrix-dense vector multiplication (and the transpose) is bandwidth limited. This work: motivated by multicore/manycore architectures where parallelism and memory bandwidth are key resources.

  • Previous work mostly focused on reducing communication

volume in distributed memory, often by using graph or hypergraph partitioning [Catalyurek & Aykanat ’99].

  • Great optimization work for SpMV on multicores by Williams, et

al.‘09, but without parallelism guarantees or multiplication with the transpose (SpMV_T).

  • Blocking for sparse matrices is not new, but mostly for cache

performance, not parallelism [Im et al.’04, Nishtala et al. ’07].

slide-17
SLIDE 17

17

Good Speedup Until Bandwidth Limit

  • Slowed down

processors (artificially introduced extra instructions) for test to hide memory issues.

  • Shows algorithm

scales well given sufficient memory bandwidth.

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup Processors

Ran on the smallest (and one of the most irregular) test matrix

slide-18
SLIDE 18

18

All about Bandwidth: Harpertown vs Nehalem

Intel Xeon X5460 @3.16Ghz Dual-socket Quad-core FSB @1333Mhz Intel Core i7 920 @2.66Ghz Single-socket Quad-core Quickpath+Hyperthreading

slide-19
SLIDE 19

Conclusions & Future Work

  • CSB allows for efficient multiplication of a sparse matrix

and its transpose by a dense vector. Future Work:

  • Does CSB work well with other computations? Sparse LU

decomposition? Sparse matrix-matrix multiplication?

  • For a symmetric matrix, need only store upper triangle of
  • matrix. Can we multiply with one read (i.e., ½ the

bandwidth)? Code (in C++ and Cilk++) available from:

http://gauss.cs.ucsb.edu/∼aydin/software.html

slide-20
SLIDE 20

Thank You !

slide-21
SLIDE 21

CSB Space Usage

Lemma: For , CSB uses bits of index data, matching CSR. Proof:

n blocks => n block pointers.

Block pointer rowind colind

  • Each block pointer uses

bits, for total

  • Each row (or column) offset

requires bits, for total.

n  

n nnz nnz n lg lg 

n  

n

nnz lg nnz nlg

nnz

n lg lg   n nnz lg ) 2 / (