A quad-tree based Sparse BLAS implementation for shared memory - - PowerPoint PPT Presentation
A quad-tree based Sparse BLAS implementation for shared memory - - PowerPoint PPT Presentation
A quad-tree based Sparse BLAS implementation for shared memory parallel computers Michele Martone Universit` a di Roma Tor Vergata , Italy Rome, May 31, 2011 Context: Sparse Matrix Computations numerical matrices which are large and
Context: Sparse Matrix Computations
◮ numerical matrices which are large and populated mostly by
zeros
◮ ubiquitous in scientific/engineering computations (e.g.:PDE) ◮ used in information retrieval and document ranking ◮ the performance of sparse matrix codes computation on
modern CPUs can be problematic (a fraction of peak)!
◮ there is no “silver bullet” for performance ◮ jargon: performance=time efficiency
An example application
Simulation of an automotive engine performed with the aid of the PSBLAS linear solver software (See [BBDM+05]). Courtesy of Salvatore Filippone.
Our Focus
The numerical solution of linear systems of the form Ax = b (with A a sparse matrix, x, y dense vectors) using iterative methods requires repeated (and thus, fast) computation of (variants of):
◮ SpMV: “y ← A x” ◮ SpMV-T: “y ← AT x” ◮ SpSV: “x ← L−1 x” ◮ SpSV-T: “x ← L−T x”
Context: shared memory parallel, cache based
high performance programming cache based, shared memory parallel computers requires:
◮ locality of memory references—for the memory hierarchy has:
◮ limited memory bandwidth ◮ memory access latency
◮ programming multiple cores for coarse-grained workload
partitioning
◮ high synchronization and cache-coherence costs
Sparse matrices require indirect addressing
e.g. (in C): k=RP[i];x[i]=x[i]+VA[k]*y[JA[k]];
◮ additional latency
(“random” accessing the first element of line i in a CSR matrix requires two dependent memory accesses)
◮ “wasting” cache lines (indices JA[k] and nearby will be cached
but not reused, and so will indices y[JA[k]] and nearby )
◮ with many active cores, saturation of the memory subsystem
traffic capacity
Remedies to indirect addressing ?
we could mitigate using...
◮ index compression: less memory traffic ◮ cache blocking: more cached data reuse ◮ dense (or register) blocking: less indices (⇒ less memory
traffic)
Literature starting points
We draw knowledge from:
◮ existing sparse matrix codes (See Filippone and
Colajanni [FC00], Buttari [But06] )
◮ classical algorithms (See Barrett et al. [BBC+94, § 4.3.1] ) ◮ novel techniques (See Bulu¸
c [BFF+09], Nishtala et
- al. [NVDY04], Vuduc [Vud03] )
Basic representation: Coordinate (COO)
A =
- a1,1
a1,2 a1,3 a2,2 a2,3 a3,3 a4,4
- ◮ VA = [a1,1, a1,2, a1,3, a2,2, a2,3, a3,3, a4,4] (nonzeroes)
◮ IA = [1, 1, 1, 2, 2, 3, 4] (nonzeroes row indices) ◮ JA = [1, 2, 3, 2, 3, 3, 4] (nonzeroes column indices) ◮ so, ai,j = VA(n) if
IA(n) = i, JA(n) = j
Pros and Cons of COO
◮ + good for layout conversion ◮ + easy implementation of many simple operations ◮ - parallel SpMV/SpMV-T requires sorting of elements ◮ - efficient SpSV is complicated, in parallel even more
Standard representation: Compressed Sparse Rows (CSR)
A =
- a1,1
a1,2 a1,3 a2,2 a2,3 a3,3 a4,4
- ◮ VA = [a1,1, a1,2, a1,3, a2,2, a2,3, a3,3, a4,4] (nonzeroes)
◮ JA = [1, 2, 3, 2, 3, 3, 4] (nonzeroes column indices) ◮ RP = [1, 4, 6, 7, 8] (row pointers, for each row) ◮ so, elements on line i are in positions
VA(RP(i)) to VA(RP(i + 1)) − 1
◮ so, ai,j = VA(n) if
JA(n) = j
Pros and Cons of CSR
◮ + common, easy to work with ◮ + parallel SpMV is feasible and reasonably efficient ◮ - parallel SpMV-T is feasible ...but inefficient with large1
matrices
◮ - impractical for parallel SpSV
1That is, when a matrix memory representation approaches to occupy the
machine’s memory size.
A recursive matrix storage: RCSR
we propose:
◮ a quad-tree of sparse leaf submatrices ◮ outcome of recursive partitioning in quadrants ◮ submatrices are stored row oriented (in CSR) ◮ an unified format for Sparse BLAS2 operations
2Basic Linear Algebra Subprograms
Recursive partitioning in RCSR
we stop subdivision of prospective leaves on an expected work/efficiency basis:
◮ leaves too small (e.g.: comparable to the cache capacity) ◮ leaves too sparse for CSR (e.g.: when nnz < rows)
Instance of an Information Retrieval matrix (573286 rows, 230401 columns, 41 · 106 nonzeroes):
Courtesy of Diego De Cao.
Dual threaded recursive SpMV
We compute y1 in the first thread, y2 in the second:
- y1
y2
- = Ax =
- A11
A12 A21 A22
- x1
x2
- =
- A11
A12
- x1
x2
- +
- A21
A22
- x1
x2
- =
- A11x1 + A12x2
- +
- A21x1 + A22x2
- Recursion continues on each thread
Single threaded recursive SpSV
Lx = b ⇒
- L1
M L2
- x1
x2
- =
- b1
b2
- x =
- x1
x2
- = L−1b =
- L1
M L2
- −1
- b1
b2
- =
- L−1
1 b1
L−1
2 (b2 − Mx1)
- This computation is executed recursively
Pros/Cons of RCSR
Experimentally, on two threads:
◮ + compares well to CSB (an efficient research prototype) ◮ + coarse partitioning of workload (especially good for parallel
transposed SpMV)
◮ + locality in the access of the matrix vectors ◮ - some matrix patterns may lead to unbalanced subdivisions ◮ - some nonempty leaf matrices may contain empty row
intervals
Multi-threaded SpMV
Y1 Y0 Y1 Y0 A X1 X0 A1 A0 X × + × Y ← ← + Y y ← y +
i Ai × xi, with leaves Ai; A = i Ai
Multi-threaded SpMV
y ← y +
i Ai × xi
Threads t ∈ {1..T} execute concurrently: yit ← yit + Ait × xit we prevent race conditions performing idle cycles if needed; we use3:
◮ per-submatrix visit information ◮ per-thread current submatrix information
3See extra: slide 29
Multi-threaded SpSV
} }
X0 X3 X5 X4 X1 = X2 X6 X0 X3 X5 X4 X1 = X2 X6 × L−1 L0 ← X X ← × L1 L2 L3 L4 S5 S6
Multi-threaded SpSV
x ← L−1x = ( Li)−1x A thread t ∈ {1...T} may perform either:
◮ xit ← xit + Lit × sit (forward substitution) ◮ xit ← L−1 it xit (xit = sit) (solve) ◮ idle (wait for dependencies)
Consistency with SpMV’s techniques plus honouring dependencies4.
4See extra:slide 30
Improving RCSR ?
◮ tuning the leaf submatrices representations
◮ compressing numerical indices on leaves ◮ using a coordinate representation on some leaves
Index compression on leaves—RCSRH
The idea:
◮ when a submatrix is less than 216 columns wide, we use a 16
bit integer type for column indices5
◮ overall, up to 16% memory saving on double precision CSR ◮ overall, up to 32% memory saving on float precision CSR
⇒ likely speedup due to reduced memory traffic!
5Instead of a standard 32 bit integer.
But how ?
◮ Our code is large (hundreds of thousands of LOC for all of our
variants6)!
◮ ⇒ using custom code generators for all variants ◮ Our matrices are usually dimensioned ≫ 216! ◮ ⇒ if possible, subdividing until submatrices are narrower than
216
6A BLAS implementation shall several different operation variants
Consequences — a finer partitioning—RCSRH
instance of kkt power (2063494 × 2063494, 813034 nonzeroes) in RCSR (left) and in RCSRH (right)
Pros/Cons of RCSRH
◮ + speedup due to reduced memory overhead ◮ - for some matrices, RCSRH requires more memory than
RCSR (a consequence of submatrices hypersparsity; see Bulu¸ c and Gilbert [BG08])
A cure ?
We introduce selectively COO (coordinate format) leaf submatrices to:
◮ use less memory than CSR in particular submatrices (more
rows than nonzeroes)
◮ subdivide more very sparse submatrices, for a better workload
partitioning
Result: RSB
Recursive Sparse Blocks: a hybrid sparse matrix format.
◮ recursive quad-partitioning ◮ CSR/COO leaves ◮ 16 bit indices whenever possible ◮ partitioning with regards to both the underlying cache size
and available threads
Pros/Cons of RSB
◮ + scalable parallel SpMV/SpMV-T ◮ + scalable parallel SpSV/SpSV-T ◮ + many other common operations (e.g.: parallel matrix build
algorithm)
◮ + native support for symmetric matrices ◮ - a number of known cases (unbalanced matrices) where
parallelism is poor
◮ - some algorithms easy to express/implement for CSR are
more complex for RSB
Performance of RSB vs MKL
Experimental time efficiency comparison of our RSB prototype to the proprietary, highly optimized Intel’s Math Kernels Library (MKL r.10.3-0) sparse matrix routines. We report here results on an Intel Xeon 5670 and publicly available matrices.
Comparison to MKL, SPMV
MFlops/sec MKL−N12 MKL−T12 RSB−N12 RSB−T12
c a g e 1 5 c i r c u i t 5 M c i r c u i t 5 M _ d c f e m _ h i f r e q _ c i r c u i t G L 7 d 1 7 G L 7 d 1 8 G L 7 d 1 9 G L 7 d 2 G L 7 d 2 1 p a t e n t s R M 7 R s
- c
− L i v e J
- u
r n a l 1 s p a l _ 4 T S O P F _ R S _ b 2 3 8 3 w b − e d u w i k i p e d i a − 2 7 2 6
1000 2000 3000
Figure: Transposed/Non transposed SpMV performance on M4, versus MKL, 12 threads, large unsymmetric matrices.
Comparison to MKL, Symmetric SPMV
MFlops/sec MKL12 RSB12
a f _ _ k 1 1 a f _ s h e l l 1 a p a c h e 1 a u d i k w _ 1 B e n E l e c h i 1 b m w 3 _ 2 b m w c r a _ 1 b
- n
e 1 b
- n
e S 1 b
- n
e S 1 c r a n k s e g _ 1 e c
- l
- g
y 1 e c
- l
- g
y 2 F 1 f c
- n
d p 2 f i l t e r 3 D G a 4 1 A s 4 1 H 7 2 h
- d
i n l i n e _ 1 k k t _ p
- w
e r l d
- r
m i p 1 m s d
- r
n d 1 2 k n d 2 4 k n l p k k t 1 2 p a r a b
- l
i c _ f e m s 3 d k q 4 m 2 s h i p s e c 8
2000 4000 6000 8000
Figure: SpMV performance on M4, versus MKL, 12 threads, symmetric matrices.
Comparison to MKL, SPSV
MFlops/sec MKL−N1 MKL−T1 RSB−N1 RSB−T1
F E M _ 3 D _ t h e r m a l 2 F E M _ 3 D _ t h e r m a l 2 − c
- l
a m d g 7 j a c 1 8 g 7 j a c 1 8 − c
- l
a m d g 7 j a c 2 g 7 j a c 2 − c
- l
a m d g a r
- n
2 g a r
- n
2 − c
- l
a m d n s 3 D a n s 3 D a − c
- l
a m d
- h
n e 2
- h
n e 2 − c
- l
a m d p a r a − 9 p a r a − 9 − c
- l
a m d p
- i
s s
- n
3 D b p
- i
s s
- n
3 D b − c
- l
a m d r a j a t 3 1 − c
- l
a m d s m e 3 D c − c
- l
a m d s t
- m
a c h s t
- m
a c h − c
- l
a m d t
- r
s
- 1
t
- r
s
- 1
− c
- l
a m d
100 200 300 400 500 600
Figure: Transposed/Non transposed SpSV performance on M4, versus MKL, single thread.
(MKL SpSV is not multithreaded)
Conclusions
A shared memory parallel Sparse BLAS library implementation:
◮ on large matrices, better performance than Intel’s highly
- ptimized, proprietary CSR implementation
◮ provides primitives for sparse solvers ◮ usable from within the open source PSBLAS library ◮ may be further tuned
Contributions in Conferences
◮
Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On the usage of 16 bit indices in recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.
◮
Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On BLAS operations with recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.
◮
Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. On the usage of 16 bit indices in recursively stored sparse matrices. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 2010.
◮
Michele Martone, Salvatore Filippone, Pawe l Gepner, Marcin Paprzycki, and Salvatore Tucci. Use of hybrid recursive CSR/COO data structures in sparse matrices-vector multiplication. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wis la, Poland, October 2010.
◮
Michele Martone, Salvatore Filippone, Marcin Paprzycki, and Salvatore Tucci. About the assembly of recursive sparse matrices. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wis
- la. Poland, October 2010.
Future Work
Forthcoming:
◮ interfacing with Matlab/Octave (via the PSBLAS solver) ◮ interfacing with Octave (standalone) ◮ study of further tuning of SpMV/SpMV-T/SpSV/SpSV-T
Possible:
◮ new specialized computational kernels ◮ conversion tools/routines ◮ new formats for leaf submatrices
References (1/2)
- R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine,
and H. Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994. Gino Bella, Alfredo Buttari, Alessandro De Maio, Francesco Del Citto, Salvatore Filippone, and Fabiano Gasperini. Fast-evp: An engine simulation tool. In Laurence T. Yang, Omer F. Rana, Beniamino Di Martino, and Jack Dongarra, editors, High Performance Computing and Communcations, volume 3726 of Lecture Notes in Computer Science, pages 969–978. Springer Berlin / Heidelberg, 2005. Aydın Bulu¸ c, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Friedhelm Meyer auf der Heide and Michael A. Bender, editors, SPAA, pages 233–244. ACM, 2009. Aydın Bulu¸ c and John R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), April 2008.
References (2/2)
Alfredo Buttari. Software Tools for Sparse Linear Algebra Computations. PhD thesis, Universit` a degli studi di Roma Tor Vergata, 2006. Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel linear algebra computation on sparse matrices. ACM Transactions on Mathematical Software, 26(4):527–550, December 2000. Rajesh Nishtala, Richard Vuduc, James W. Demmel, and Katherine A. Yelick. When cache blocking sparse matrix vector multiply works and why. In In Proceedings of the PARA’04 Workshop on the State-of-the-art in Scientific Computing, 2004. Richard Wilson Vuduc. Automatic performance tuning of sparse matrix kernels (phd thesis). PhD thesis, University Of California, Berkeley, 2003.
The end
Thank you for your attention!
(Extra slides following)
Dual threaded recursive SpMV-T
We compute y1 in the first thread, y2 in the second:
- y1
y2
- = ATx =
- A11
A12 A21 A22
- T
- x1
x2
- =
- AT
11
AT
21
AT
12
AT
22
- x1
x2
- =
- AT
11
AT
21
- x1
x2
- +
- AT
12
AT
22
- x1
x2
- =
- AT
11x1 + AT 21x2
- +
- AT
12x1 + AT 22x2
- Recursion continues on each thread
Multi-threaded SPMV
y ← y +
i Ai × xi
Concurrently on threads t ∈ {1..T}: yit ← yit + Ait × xit For threads (t, u), no (yit, yiu) shall intersect at a given time — using lock+usage bitmap to prevent race conditions
Multi-threaded SPSV
x ← L−1x = ( Li)−1x On threads t ∈ {1...T} either task: 1) xit ← xit + Lit × sit (forward substitution)
- r
2) xit ← L−1
it xit (xit = sit) (solve)
xi: subrow on L′
its rows range
si: subrow on L′
its columns range (using lock+usage map+array)
1 2 4 8 16 32 64 128 256 512 1024 20 40 60 80 100 M2−IND−O0 M2−IND−O3 M2−LIN−O0 M2−LIN−O3 M2−RND−O0 M2−RND−O3 M4−IND−O0 M4−IND−O3 M4−LIN−O0 M4−LIN−O3 M4−RND−O0 M4−RND−O3
Relative Memory Scan Speeds
stride (#of words) words per second to 1−stride (%)
Figure: The relative performance of some linear scan primitives on M2 (an AMD Opteron 2354),M4 (Intel Xeon 5670). We have parameters ws = 8, κ = 1024, wn = 128 · 1024.
1 2 4 8 16 32 64 128 256 512 1024 20 40 60 80 100 M6−IND−O0 M6−IND−O3 M6−LIN−O0 M6−LIN−O3 M6−RND−O0 M6−RND−O3 M8−IND−O0 M8−IND−O3 M8−LIN−O0 M8−LIN−O3 M8−RND−O0 M8−RND−O3
Relative Memory Scan Speeds
stride (#of words) words per second to 1−stride (%)