A quad-tree based Sparse BLAS implementation for shared memory - PowerPoint PPT Presentation

A quad-tree based Sparse BLAS implementation for shared memory parallel computers Michele Martone Universit` a di Roma “Tor Vergata” , Italy Rome, May 31, 2011

Context: Sparse Matrix Computations ◮ numerical matrices which are large and populated mostly by zeros ◮ ubiquitous in scientific/engineering computations (e.g.:PDE) ◮ used in information retrieval and document ranking ◮ the performance of sparse matrix codes computation on modern CPUs can be problematic (a fraction of peak)! ◮ there is no “silver bullet” for performance ◮ jargon: performance = time efficiency

An example application Simulation of an automotive engine performed with the aid of the PSBLAS linear solver software (See [BBDM + 05]). Courtesy of Salvatore Filippone.

Our Focus The numerical solution of linear systems of the form Ax = b (with A a sparse matrix, x , y dense vectors) using iterative methods requires repeated (and thus, fast ) computation of (variants of): ◮ SpMV : “ y ← A x ” ◮ SpMV-T : “ y ← A T x ” ◮ SpSV : “ x ← L − 1 x ” ◮ SpSV-T : “ x ← L − T x ”

Context: shared memory parallel, cache based high performance programming cache based , shared memory parallel computers requires: ◮ locality of memory references —for the memory hierarchy has: ◮ limited memory bandwidth ◮ memory access latency ◮ programming multiple cores for coarse-grained workload partitioning ◮ high synchronization and cache-coherence costs

Sparse matrices require indirect addressing e.g. (in C): k=RP[i];x[i]=x[i]+VA[k]*y[JA[k]]; ◮ additional latency (“random” accessing the first element of line i in a CSR matrix requires two dependent memory accesses) ◮ “wasting” cache lines (indices JA [ k ] and nearby will be cached but not reused, and so will indices y [ JA [ k ]] and nearby ) ◮ with many active cores, saturation of the memory subsystem traffic capacity

Remedies to indirect addressing ? we could mitigate using... ◮ index compression: less memory traffic ◮ cache blocking: more cached data reuse ◮ dense (or register) blocking: less indices ( ⇒ less memory traffic )

Literature starting points We draw knowledge from: ◮ existing sparse matrix codes (See Filippone and Colajanni [FC00], Buttari [But06] ) ◮ classical algorithms (See Barrett et al. [BBC + 94, § 4.3.1] ) c [BFF + 09], Nishtala et ◮ novel techniques (See Bulu¸ al. [NVDY04], Vuduc [Vud03] )

Basic representation: Coordinate (COO) � � a 1 , 1 a 1 , 2 a 1 , 3 0 � � � � 0 a 2 , 2 a 2 , 3 0 � � A = � � 0 0 0 a 3 , 3 � � � � 0 0 0 a 4 , 4 � � ◮ VA = [ a 1 , 1 , a 1 , 2 , a 1 , 3 , a 2 , 2 , a 2 , 3 , a 3 , 3 , a 4 , 4 ] ( nonzeroes ) ◮ IA = [1 , 1 , 1 , 2 , 2 , 3 , 4] ( nonzeroes row indices ) ◮ JA = [1 , 2 , 3 , 2 , 3 , 3 , 4] ( nonzeroes column indices ) ◮ so, a i , j = VA ( n ) if IA ( n ) = i , JA ( n ) = j

Pros and Cons of COO ◮ + good for layout conversion ◮ + easy implementation of many simple operations ◮ - parallel SpMV / SpMV-T requires sorting of elements ◮ - efficient SpSV is complicated, in parallel even more

Standard representation: Compressed Sparse Rows (CSR) � � a 1 , 1 a 1 , 2 a 1 , 3 0 � � � � 0 a 2 , 2 a 2 , 3 0 � � A = � � 0 0 0 a 3 , 3 � � � � 0 0 0 a 4 , 4 � � ◮ VA = [ a 1 , 1 , a 1 , 2 , a 1 , 3 , a 2 , 2 , a 2 , 3 , a 3 , 3 , a 4 , 4 ] ( nonzeroes ) ◮ JA = [1 , 2 , 3 , 2 , 3 , 3 , 4] ( nonzeroes column indices ) ◮ RP = [1 , 4 , 6 , 7 , 8] ( row pointers, for each row ) ◮ so, elements on line i are in positions VA ( RP ( i )) to VA ( RP ( i + 1)) − 1 ◮ so, a i , j = VA ( n ) if JA ( n ) = j

Pros and Cons of CSR ◮ + common, easy to work with ◮ + parallel SpMV is feasible and reasonably efficient ◮ - parallel SpMV-T is feasible ...but inefficient with large 1 matrices ◮ - impractical for parallel SpSV 1 That is, when a matrix memory representation approaches to occupy the machine’s memory size.

A recursive matrix storage: RCSR we propose: ◮ a quad-tree of sparse leaf submatrices ◮ outcome of recursive partitioning in quadrants ◮ submatrices are stored row oriented (in CSR) ◮ an unified format for Sparse BLAS 2 operations 2 Basic Linear Algebra Subprograms

Recursive partitioning in RCSR we stop subdivision of prospective leaves on an expected work/efficiency basis: ◮ leaves too small (e.g.: comparable to the cache capacity) ◮ leaves too sparse for CSR (e.g.: when nnz < rows )

Instance of an Information Retrieval matrix (573286 rows, 230401 columns, 41 · 10 6 nonzeroes): Courtesy of Diego De Cao.

Dual threaded recursive SpMV We compute y 1 in the first thread, y 2 in the second: � � � � � � y 1 A 11 A 12 x 1 � � � � � � � = Ax = � � � � � � y 2 A 21 A 22 x 2 � � � � � � � � � � � � � 0 0 A 11 A 12 x 1 x 1 � � � � � � � � = � + � � � � � � � � 0 0 x 2 A 21 A 22 x 2 � � � � � � � � � � � A 11 x 1 + A 12 x 2 0 � � � � = � + � � � � 0 A 21 x 1 + A 22 x 2 � � � Recursion continues on each thread

Single threaded recursive SpSV � � � � � � 0 L 1 x 1 b 1 � � � � � � Lx = b ⇒ � = � � � � � � M L 2 x 2 b 2 � � � � � − 1 � L − 1 � � � � � � � x 1 L 1 0 b 1 1 b 1 � = L − 1 b = � � � � � � � � x = � = � � � � � � � L − 1 � x 2 M L 2 b 2 2 ( b 2 − Mx 1 ) � � � � � � This computation is executed recursively

Pros/Cons of RCSR Experimentally, on two threads: ◮ + compares well to CSB (an efficient research prototype) ◮ + coarse partitioning of workload (especially good for parallel transposed SpMV ) ◮ + locality in the access of the matrix vectors ◮ - some matrix patterns may lead to unbalanced subdivisions ◮ - some nonempty leaf matrices may contain empty row intervals

Multi-threaded SpMV ← × Y Y + A X Y 0 Y 0 A 0 X 0 ← × + X 1 Y 1 Y 1 A 1 y ← y + � i A i × x i , with leaves A i ; A = � i A i

Multi-threaded SpMV y ← y + � i A i × x i Threads t ∈ { 1 .. T } execute concurrently: y i t ← y i t + A i t × x i t we prevent race conditions performing idle cycles if needed; we use 3 : ◮ per-submatrix visit information ◮ per-thread current submatrix information 3 See extra: slide 29

Multi-threaded SpSV L − 1 ← × X X L 0 X 0 X 0 ← × X 1 = X 2 L 1 L 2 X 1 = X 2 X 3 L 3 X 3 } } L 4 X 5 X 4 S 5 X 5 X 4 X 6 S 6 X 6

Multi-threaded SpSV x ← L − 1 x = ( � L i ) − 1 x A thread t ∈ { 1 ... T } may perform either: ◮ x i t ← x i t + L i t × s i t (forward substitution) ◮ x i t ← L − 1 i t x i t ( x i t = s i t ) (solve) ◮ idle (wait for dependencies) Consistency with SpMV ’s techniques plus honouring dependencies 4 . 4 See extra:slide 30

Improving RCSR ? ◮ tuning the leaf submatrices representations ◮ compressing numerical indices on leaves ◮ using a coordinate representation on some leaves

Index compression on leaves—RCSRH The idea: ◮ when a submatrix is less than 2 16 columns wide, we use a 16 bit integer type for column indices 5 ◮ overall, up to 16% memory saving on double precision CSR ◮ overall, up to 32% memory saving on float precision CSR ⇒ likely speedup due to reduced memory traffic ! 5 Instead of a standard 32 bit integer.

But how ? ◮ Our code is large (hundreds of thousands of LOC for all of our variants 6 )! ◮ ⇒ using custom code generators for all variants ◮ Our matrices are usually dimensioned ≫ 2 16 ! ◮ ⇒ if possible, subdividing until submatrices are narrower than 2 16 6 A BLAS implementation shall several different operation variants

Consequences — a finer partitioning—RCSRH instance of kkt power (2063494 × 2063494 , 813034 nonzeroes) in RCSR (left) and in RCSRH (right)

Pros/Cons of RCSRH ◮ + speedup due to reduced memory overhead ◮ - for some matrices, RCSRH requires more memory than RCSR (a consequence of submatrices hypersparsity ; see Bulu¸ c and Gilbert [BG08])

A cure ? We introduce selectively COO (coordinate format) leaf submatrices to: ◮ use less memory than CSR in particular submatrices (more rows than nonzeroes) ◮ subdivide more very sparse submatrices, for a better workload partitioning

Result: RSB Recursive Sparse Blocks: a hybrid sparse matrix format. ◮ recursive quad-partitioning ◮ CSR/COO leaves ◮ 16 bit indices whenever possible ◮ partitioning with regards to both the underlying cache size and available threads

Pros/Cons of RSB ◮ + scalable parallel SpMV / SpMV-T ◮ + scalable parallel SpSV / SpSV-T ◮ + many other common operations (e.g.: parallel matrix build algorithm) ◮ + native support for symmetric matrices ◮ - a number of known cases (unbalanced matrices) where parallelism is poor ◮ - some algorithms easy to express/implement for CSR are more complex for RSB

Performance of RSB vs MKL Experimental time efficiency comparison of our RSB prototype to the proprietary, highly optimized Intel’s Math Kernels Library (MKL r.10.3-0) sparse matrix routines. We report here results on an Intel Xeon 5670 and publicly available matrices.

A quad-tree based Sparse BLAS implementation for shared memory - PowerPoint PPT Presentation

A quad-tree based Sparse BLAS implementation for shared memory parallel computers Michele Martone Universit` a di Roma Tor Vergata , Italy Rome, May 31, 2011 Context: Sparse Matrix Computations numerical matrices which are large and

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

QUAD CITIES AREA BEHAVIORAL HEALTH AND IDD PROJECT A Partnership between United Way of the Quad

Seahawk Quad Parents Council Spring 20 Meeting February 22, 2020 Seahawk Quad Bullet

Quad module development for pixel layer upgrades Katie Dunne Student Instrumentation Meeting -

Quad Core Results John M. Levesque May, , 2008 Ten Lessons from Quad Core Dont Believe

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Pretorsion theories in arbitrary categories Alberto Facchini Universit` a di Padova Almer

BLAS for Tensors: What, Why, and How Devin Ma;hews

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Numerical Linear Algebra Software (based on slides written by Michael Grant) BLAS, ATLAS

19. Quadtrees Quadtrees, Collision Detection, Image Segmentation 514 Quadtree A quad tree is a

Session 12 Tree-based models: tree and rpart Two libraries The tree library is like the

Approximate Voronoi Diagrams: Techniques, tools, and applications to k th ANN search Nirman Kumar

Locality-Sensitive Orderings ANN -Quadtree Walecki Theorem Local-Sensitivity Anil Maheshwari

Advances in Programming Languages APL4: Row variables in OCaml Structural typing for objects

Space-optimized texture atlas Jons Martnez Bayona Advisor: Carlos Andujar Master in

CSE326:DataStructures Lecture#21 OneLastGasp BartNiswonger

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni Tzirita Zacharatou Thomas

CS525: Advanced Database Organization Notes 6: Multi-dimensional indexes Yousef M. Elmehdwi

Flight Simulation Advisor: Hans de Nivelle Team: Alisher Shakhiyev, Alen German, Auyez