A quad-tree based Sparse BLAS implementation for shared memory - - PowerPoint PPT Presentation

a quad tree based sparse blas implementation for shared
SMART_READER_LITE
LIVE PREVIEW

A quad-tree based Sparse BLAS implementation for shared memory - - PowerPoint PPT Presentation

A quad-tree based Sparse BLAS implementation for shared memory parallel computers Michele Martone Universit` a di Roma Tor Vergata , Italy Rome, May 31, 2011 Context: Sparse Matrix Computations numerical matrices which are large and


slide-1
SLIDE 1

A quad-tree based Sparse BLAS implementation for shared memory parallel computers

Michele Martone

Universit` a di Roma “Tor Vergata”, Italy

Rome, May 31, 2011

slide-2
SLIDE 2

Context: Sparse Matrix Computations

◮ numerical matrices which are large and populated mostly by

zeros

◮ ubiquitous in scientific/engineering computations (e.g.:PDE) ◮ used in information retrieval and document ranking ◮ the performance of sparse matrix codes computation on

modern CPUs can be problematic (a fraction of peak)!

◮ there is no “silver bullet” for performance ◮ jargon: performance=time efficiency

slide-3
SLIDE 3

An example application

Simulation of an automotive engine performed with the aid of the PSBLAS linear solver software (See [BBDM+05]). Courtesy of Salvatore Filippone.

slide-4
SLIDE 4

Our Focus

The numerical solution of linear systems of the form Ax = b (with A a sparse matrix, x, y dense vectors) using iterative methods requires repeated (and thus, fast) computation of (variants of):

◮ SpMV: “y ← A x” ◮ SpMV-T: “y ← AT x” ◮ SpSV: “x ← L−1 x” ◮ SpSV-T: “x ← L−T x”

slide-5
SLIDE 5

Context: shared memory parallel, cache based

high performance programming cache based, shared memory parallel computers requires:

◮ locality of memory references—for the memory hierarchy has:

◮ limited memory bandwidth ◮ memory access latency

◮ programming multiple cores for coarse-grained workload

partitioning

◮ high synchronization and cache-coherence costs

slide-6
SLIDE 6

Sparse matrices require indirect addressing

e.g. (in C): k=RP[i];x[i]=x[i]+VA[k]*y[JA[k]];

◮ additional latency

(“random” accessing the first element of line i in a CSR matrix requires two dependent memory accesses)

◮ “wasting” cache lines (indices JA[k] and nearby will be cached

but not reused, and so will indices y[JA[k]] and nearby )

◮ with many active cores, saturation of the memory subsystem

traffic capacity

slide-7
SLIDE 7

Remedies to indirect addressing ?

we could mitigate using...

◮ index compression: less memory traffic ◮ cache blocking: more cached data reuse ◮ dense (or register) blocking: less indices (⇒ less memory

traffic)

slide-8
SLIDE 8

Literature starting points

We draw knowledge from:

◮ existing sparse matrix codes (See Filippone and

Colajanni [FC00], Buttari [But06] )

◮ classical algorithms (See Barrett et al. [BBC+94, § 4.3.1] ) ◮ novel techniques (See Bulu¸

c [BFF+09], Nishtala et

  • al. [NVDY04], Vuduc [Vud03] )
slide-9
SLIDE 9

Basic representation: Coordinate (COO)

A =

  • a1,1

a1,2 a1,3 a2,2 a2,3 a3,3 a4,4

  • ◮ VA = [a1,1, a1,2, a1,3, a2,2, a2,3, a3,3, a4,4] (nonzeroes)

◮ IA = [1, 1, 1, 2, 2, 3, 4] (nonzeroes row indices) ◮ JA = [1, 2, 3, 2, 3, 3, 4] (nonzeroes column indices) ◮ so, ai,j = VA(n) if

IA(n) = i, JA(n) = j

slide-10
SLIDE 10

Pros and Cons of COO

◮ + good for layout conversion ◮ + easy implementation of many simple operations ◮ - parallel SpMV/SpMV-T requires sorting of elements ◮ - efficient SpSV is complicated, in parallel even more

slide-11
SLIDE 11

Standard representation: Compressed Sparse Rows (CSR)

A =

  • a1,1

a1,2 a1,3 a2,2 a2,3 a3,3 a4,4

  • ◮ VA = [a1,1, a1,2, a1,3, a2,2, a2,3, a3,3, a4,4] (nonzeroes)

◮ JA = [1, 2, 3, 2, 3, 3, 4] (nonzeroes column indices) ◮ RP = [1, 4, 6, 7, 8] (row pointers, for each row) ◮ so, elements on line i are in positions

VA(RP(i)) to VA(RP(i + 1)) − 1

◮ so, ai,j = VA(n) if

JA(n) = j

slide-12
SLIDE 12

Pros and Cons of CSR

◮ + common, easy to work with ◮ + parallel SpMV is feasible and reasonably efficient ◮ - parallel SpMV-T is feasible ...but inefficient with large1

matrices

◮ - impractical for parallel SpSV

1That is, when a matrix memory representation approaches to occupy the

machine’s memory size.

slide-13
SLIDE 13

A recursive matrix storage: RCSR

we propose:

◮ a quad-tree of sparse leaf submatrices ◮ outcome of recursive partitioning in quadrants ◮ submatrices are stored row oriented (in CSR) ◮ an unified format for Sparse BLAS2 operations

2Basic Linear Algebra Subprograms

slide-14
SLIDE 14

Recursive partitioning in RCSR

we stop subdivision of prospective leaves on an expected work/efficiency basis:

◮ leaves too small (e.g.: comparable to the cache capacity) ◮ leaves too sparse for CSR (e.g.: when nnz < rows)

slide-15
SLIDE 15

Instance of an Information Retrieval matrix (573286 rows, 230401 columns, 41 · 106 nonzeroes):

Courtesy of Diego De Cao.

slide-16
SLIDE 16

Dual threaded recursive SpMV

We compute y1 in the first thread, y2 in the second:

  • y1

y2

  • = Ax =
  • A11

A12 A21 A22

  • x1

x2

  • =
  • A11

A12

  • x1

x2

  • +
  • A21

A22

  • x1

x2

  • =
  • A11x1 + A12x2
  • +
  • A21x1 + A22x2
  • Recursion continues on each thread
slide-17
SLIDE 17

Single threaded recursive SpSV

Lx = b ⇒

  • L1

M L2

  • x1

x2

  • =
  • b1

b2

  • x =
  • x1

x2

  • = L−1b =
  • L1

M L2

  • −1
  • b1

b2

  • =
  • L−1

1 b1

L−1

2 (b2 − Mx1)

  • This computation is executed recursively
slide-18
SLIDE 18

Pros/Cons of RCSR

Experimentally, on two threads:

◮ + compares well to CSB (an efficient research prototype) ◮ + coarse partitioning of workload (especially good for parallel

transposed SpMV)

◮ + locality in the access of the matrix vectors ◮ - some matrix patterns may lead to unbalanced subdivisions ◮ - some nonempty leaf matrices may contain empty row

intervals

slide-19
SLIDE 19

Multi-threaded SpMV

Y1 Y0 Y1 Y0 A X1 X0 A1 A0 X × + × Y ← ← + Y y ← y +

i Ai × xi, with leaves Ai; A = i Ai

slide-20
SLIDE 20

Multi-threaded SpMV

y ← y +

i Ai × xi

Threads t ∈ {1..T} execute concurrently: yit ← yit + Ait × xit we prevent race conditions performing idle cycles if needed; we use3:

◮ per-submatrix visit information ◮ per-thread current submatrix information

3See extra: slide 29

slide-21
SLIDE 21

Multi-threaded SpSV

} }

X0 X3 X5 X4 X1 = X2 X6 X0 X3 X5 X4 X1 = X2 X6 × L−1 L0 ← X X ← × L1 L2 L3 L4 S5 S6

slide-22
SLIDE 22

Multi-threaded SpSV

x ← L−1x = ( Li)−1x A thread t ∈ {1...T} may perform either:

◮ xit ← xit + Lit × sit (forward substitution) ◮ xit ← L−1 it xit (xit = sit) (solve) ◮ idle (wait for dependencies)

Consistency with SpMV’s techniques plus honouring dependencies4.

4See extra:slide 30

slide-23
SLIDE 23

Improving RCSR ?

◮ tuning the leaf submatrices representations

◮ compressing numerical indices on leaves ◮ using a coordinate representation on some leaves

slide-24
SLIDE 24

Index compression on leaves—RCSRH

The idea:

◮ when a submatrix is less than 216 columns wide, we use a 16

bit integer type for column indices5

◮ overall, up to 16% memory saving on double precision CSR ◮ overall, up to 32% memory saving on float precision CSR

⇒ likely speedup due to reduced memory traffic!

5Instead of a standard 32 bit integer.

slide-25
SLIDE 25

But how ?

◮ Our code is large (hundreds of thousands of LOC for all of our

variants6)!

◮ ⇒ using custom code generators for all variants ◮ Our matrices are usually dimensioned ≫ 216! ◮ ⇒ if possible, subdividing until submatrices are narrower than

216

6A BLAS implementation shall several different operation variants

slide-26
SLIDE 26

Consequences — a finer partitioning—RCSRH

instance of kkt power (2063494 × 2063494, 813034 nonzeroes) in RCSR (left) and in RCSRH (right)

slide-27
SLIDE 27

Pros/Cons of RCSRH

◮ + speedup due to reduced memory overhead ◮ - for some matrices, RCSRH requires more memory than

RCSR (a consequence of submatrices hypersparsity; see Bulu¸ c and Gilbert [BG08])

slide-28
SLIDE 28

A cure ?

We introduce selectively COO (coordinate format) leaf submatrices to:

◮ use less memory than CSR in particular submatrices (more

rows than nonzeroes)

◮ subdivide more very sparse submatrices, for a better workload

partitioning

slide-29
SLIDE 29

Result: RSB

Recursive Sparse Blocks: a hybrid sparse matrix format.

◮ recursive quad-partitioning ◮ CSR/COO leaves ◮ 16 bit indices whenever possible ◮ partitioning with regards to both the underlying cache size

and available threads

slide-30
SLIDE 30

Pros/Cons of RSB

◮ + scalable parallel SpMV/SpMV-T ◮ + scalable parallel SpSV/SpSV-T ◮ + many other common operations (e.g.: parallel matrix build

algorithm)

◮ + native support for symmetric matrices ◮ - a number of known cases (unbalanced matrices) where

parallelism is poor

◮ - some algorithms easy to express/implement for CSR are

more complex for RSB

slide-31
SLIDE 31

Performance of RSB vs MKL

Experimental time efficiency comparison of our RSB prototype to the proprietary, highly optimized Intel’s Math Kernels Library (MKL r.10.3-0) sparse matrix routines. We report here results on an Intel Xeon 5670 and publicly available matrices.

slide-32
SLIDE 32

Comparison to MKL, SPMV

MFlops/sec MKL−N12 MKL−T12 RSB−N12 RSB−T12

c a g e 1 5 c i r c u i t 5 M c i r c u i t 5 M _ d c f e m _ h i f r e q _ c i r c u i t G L 7 d 1 7 G L 7 d 1 8 G L 7 d 1 9 G L 7 d 2 G L 7 d 2 1 p a t e n t s R M 7 R s

  • c

− L i v e J

  • u

r n a l 1 s p a l _ 4 T S O P F _ R S _ b 2 3 8 3 w b − e d u w i k i p e d i a − 2 7 2 6

1000 2000 3000

Figure: Transposed/Non transposed SpMV performance on M4, versus MKL, 12 threads, large unsymmetric matrices.

slide-33
SLIDE 33

Comparison to MKL, Symmetric SPMV

MFlops/sec MKL12 RSB12

a f _ _ k 1 1 a f _ s h e l l 1 a p a c h e 1 a u d i k w _ 1 B e n E l e c h i 1 b m w 3 _ 2 b m w c r a _ 1 b

  • n

e 1 b

  • n

e S 1 b

  • n

e S 1 c r a n k s e g _ 1 e c

  • l
  • g

y 1 e c

  • l
  • g

y 2 F 1 f c

  • n

d p 2 f i l t e r 3 D G a 4 1 A s 4 1 H 7 2 h

  • d

i n l i n e _ 1 k k t _ p

  • w

e r l d

  • r

m i p 1 m s d

  • r

n d 1 2 k n d 2 4 k n l p k k t 1 2 p a r a b

  • l

i c _ f e m s 3 d k q 4 m 2 s h i p s e c 8

2000 4000 6000 8000

Figure: SpMV performance on M4, versus MKL, 12 threads, symmetric matrices.

slide-34
SLIDE 34

Comparison to MKL, SPSV

MFlops/sec MKL−N1 MKL−T1 RSB−N1 RSB−T1

F E M _ 3 D _ t h e r m a l 2 F E M _ 3 D _ t h e r m a l 2 − c

  • l

a m d g 7 j a c 1 8 g 7 j a c 1 8 − c

  • l

a m d g 7 j a c 2 g 7 j a c 2 − c

  • l

a m d g a r

  • n

2 g a r

  • n

2 − c

  • l

a m d n s 3 D a n s 3 D a − c

  • l

a m d

  • h

n e 2

  • h

n e 2 − c

  • l

a m d p a r a − 9 p a r a − 9 − c

  • l

a m d p

  • i

s s

  • n

3 D b p

  • i

s s

  • n

3 D b − c

  • l

a m d r a j a t 3 1 − c

  • l

a m d s m e 3 D c − c

  • l

a m d s t

  • m

a c h s t

  • m

a c h − c

  • l

a m d t

  • r

s

  • 1

t

  • r

s

  • 1

− c

  • l

a m d

100 200 300 400 500 600

Figure: Transposed/Non transposed SpSV performance on M4, versus MKL, single thread.

(MKL SpSV is not multithreaded)

slide-35
SLIDE 35

Conclusions

A shared memory parallel Sparse BLAS library implementation:

◮ on large matrices, better performance than Intel’s highly

  • ptimized, proprietary CSR implementation

◮ provides primitives for sparse solvers ◮ usable from within the open source PSBLAS library ◮ may be further tuned

slide-36
SLIDE 36

Contributions in Conferences

Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On the usage of 16 bit indices in recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.

Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On BLAS operations with recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.

Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On the usage of 16 bit indices in recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.

Michele Martone, Salvatore Filippone, Pawe l Gepner, Marcin Paprzycki, and Salvatore Tucci. Use of hybrid recursive CSR/COO data structures in sparse matrices-vector multiplication. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wis la, Poland, October 2010.

Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. About the assembly of recursive sparse matrices. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wis

  • la. Poland, October 2010.
slide-37
SLIDE 37

Future Work

Forthcoming:

◮ interfacing with Matlab/Octave (via the PSBLAS solver) ◮ interfacing with Octave (standalone) ◮ study of further tuning of SpMV/SpMV-T/SpSV/SpSV-T

Possible:

◮ new specialized computational kernels ◮ conversion tools/routines ◮ new formats for leaf submatrices

slide-38
SLIDE 38

References (1/2)

  • R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine,

and H. Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994. Gino Bella, Alfredo Buttari, Alessandro De Maio, Francesco Del Citto, Salvatore Filippone, and Fabiano Gasperini. Fast-evp: An engine simulation tool. In Laurence T. Yang, Omer F. Rana, Beniamino Di Martino, and Jack Dongarra, editors, High Performance Computing and Communcations, volume 3726 of Lecture Notes in Computer Science, pages 969–978. Springer Berlin / Heidelberg, 2005. Aydın Bulu¸ c, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Friedhelm Meyer auf der Heide and Michael A. Bender, editors, SPAA, pages 233–244. ACM, 2009. Aydın Bulu¸ c and John R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), April 2008.

slide-39
SLIDE 39

References (2/2)

Alfredo Buttari. Software Tools for Sparse Linear Algebra Computations. PhD thesis, Universit` a degli studi di Roma Tor Vergata, 2006. Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel linear algebra computation on sparse matrices. ACM Transactions on Mathematical Software, 26(4):527–550, December 2000. Rajesh Nishtala, Richard Vuduc, James W. Demmel, and Katherine A. Yelick. When cache blocking sparse matrix vector multiply works and why. In In Proceedings of the PARA’04 Workshop on the State-of-the-art in Scientific Computing, 2004. Richard Wilson Vuduc. Automatic performance tuning of sparse matrix kernels (phd thesis). PhD thesis, University Of California, Berkeley, 2003.

slide-40
SLIDE 40

The end

Thank you for your attention!

(Extra slides following)

slide-41
SLIDE 41

Dual threaded recursive SpMV-T

We compute y1 in the first thread, y2 in the second:

  • y1

y2

  • = ATx =
  • A11

A12 A21 A22

  • T
  • x1

x2

  • =
  • AT

11

AT

21

AT

12

AT

22

  • x1

x2

  • =
  • AT

11

AT

21

  • x1

x2

  • +
  • AT

12

AT

22

  • x1

x2

  • =
  • AT

11x1 + AT 21x2

  • +
  • AT

12x1 + AT 22x2

  • Recursion continues on each thread
slide-42
SLIDE 42

Multi-threaded SPMV

y ← y +

i Ai × xi

Concurrently on threads t ∈ {1..T}: yit ← yit + Ait × xit For threads (t, u), no (yit, yiu) shall intersect at a given time — using lock+usage bitmap to prevent race conditions

slide-43
SLIDE 43

Multi-threaded SPSV

x ← L−1x = ( Li)−1x On threads t ∈ {1...T} either task: 1) xit ← xit + Lit × sit (forward substitution)

  • r

2) xit ← L−1

it xit (xit = sit) (solve)

xi: subrow on L′

its rows range

si: subrow on L′

its columns range (using lock+usage map+array)

slide-44
SLIDE 44

1 2 4 8 16 32 64 128 256 512 1024 20 40 60 80 100 M2−IND−O0 M2−IND−O3 M2−LIN−O0 M2−LIN−O3 M2−RND−O0 M2−RND−O3 M4−IND−O0 M4−IND−O3 M4−LIN−O0 M4−LIN−O3 M4−RND−O0 M4−RND−O3

Relative Memory Scan Speeds

stride (#of words) words per second to 1−stride (%)

Figure: The relative performance of some linear scan primitives on M2 (an AMD Opteron 2354),M4 (Intel Xeon 5670). We have parameters ws = 8, κ = 1024, wn = 128 · 1024.

slide-45
SLIDE 45

1 2 4 8 16 32 64 128 256 512 1024 20 40 60 80 100 M6−IND−O0 M6−IND−O3 M6−LIN−O0 M6−LIN−O3 M6−RND−O0 M6−RND−O3 M8−IND−O0 M8−IND−O3 M8−LIN−O0 M8−LIN−O3 M8−RND−O0 M8−RND−O3

Relative Memory Scan Speeds

stride (#of words) words per second to 1−stride (%)

Figure: The relative performance of some linear scan primitives on M6 (an Atom 450N),M8(a Pentium III). We have parameters ws = 8, κ = 1024, wn = 128 · 1024.

slide-46
SLIDE 46

End of Extra slides

Thank you for your extra attention!