Cache-oblivious sparse matrixvector multiplication Albert-Jan - PowerPoint PPT Presentation

Cache-oblivious sparse matrix–vector multiplication Cache-oblivious sparse matrix–vector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication Motivations Basic implementations can suffer up to 2x slowdown. Even worse: dedicated libraries may in some cases still show a similar level of inefficiency. Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication Outline Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Memory and multiplication Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Cache parameters Size S (in bytes) Line size L S (bytes) Number of cache lines L = ( S / L S ) Number of subcaches k Number of levels Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Naive cache k = 1, modulo mapped cache Memory (of length L S ) from RAM with start address x is stored in cache line number x mod L : Main memory (RAM) Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication ’Ideal’ cache Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: Req. x 1 , . . . , x 4 Req. x 2 Req. x 5 x 1 x 4 x 2 x 5 ⇒ x 3 ⇒ x 4 ⇒ x 2 x 2 x 3 x 4 x 1 x 1 x 3 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Realistic cache 1 < k < L , combining modulo-mapping and the LRU policy Modulo mapping Cache LRU−stack Main memory (RAM) Subcaches Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Multilevel caches Main memory ✂✁✂✁✂✁✂ �✁�✁�✁� ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ �✁�✁�✁� ✂✁✂✁✂✁✂ Cache Cache ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ CPU ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ✂✁✂✁✂✁✂ �✁�✁�✁� ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ (L1) ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ (L2) ✂✁✂✁✂✁✂ �✁�✁�✁� (RAM) Intel Core2 AMD K8 L1: S = 32kB k = 8 L1: S = 16kB k = 2 L2: S = 4MB k = 16 L2: S = 1MB k = 16 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 ⇒ = Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 x 0 ⇒ ⇒ = = Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 0 a 00 ⇒ ⇒ ⇒ = = x 0 = Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 x 0 a 00 y 0 ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = x 0 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 x 0 a 00 y 0 x 1 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 0 a 00 x 0 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 y 0 x 0 a 00 y 0 x 1 a 01 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 1 x 0 a 00 a 00 x 0 x 0 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication When k , L are a bit larger, we can predict the following: the lower elements from the vector x (that is, x 0 , x 1 , . . . , x i for some i < n ) are evicted while processing the entire first row. This causes O ( n ) cache misses on the remaining m − 1 rows. Fix: stop processing a row before an element from x would be evicted and first continue row-wise: i.e., process Ax by doing MVs on m × q submatrices: y = a 0 x + a 1 x + . . . Unwanted side effect: now lower elements from the vector y can be prematurely evicted... Fix: stop processing a submatrix before an element from y would be evicted; the MV routine now is applied on p × q submatrices. This approach is cache-aware ; implemented in, e.g., GotoBLAS. Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS) Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS)  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] row: [0 3 5 7 10] Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? = ⇒ Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? x ? = ⇒ = ⇒ Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a 0? x ? = = ⇒ = ⇒ ⇒ Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? x ? a 0? y 0 x ? = a 0? = ⇒ = ⇒ ⇒ = ⇒ x ? Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? x ? a 0? y 0 x ? x ? = a 0? y 0 = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? x ? Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? y ? x ? a 0? y 0 x ? a ?? x ? = a 0? y 0 x ? = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? y ? x ? a 0? x ? We cannot predict memory accesses in the sparse case! Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures Cache-friendly data structures Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

Cache-oblivious sparse matrixvector multiplication Albert-Jan - PowerPoint PPT Presentation

Cache-oblivious sparse matrixvector multiplication Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling Cache-oblivious sparse

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from

The Faber-Manteuffel Theorem and its Consequences Petr Tich joint work with Vance Faber,

Structured matrices in the computation of band spectra of photonic crystals Pietro Contu,

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research

EI331 Signals and Systems Lecture 17 Bo Jiang John Hopcroft Center for Computer Science

Cache-oblivious sparse matrixvector multiplication Albert-Jan - PowerPoint PPT Presentation

Cache-oblivious sparse matrixvector multiplication Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling Cache-oblivious sparse

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman &amp; Rob H. Bisseling

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from

The Faber-Manteuffel Theorem and its Consequences Petr Tich joint work with Vance Faber,

Structured matrices in the computation of band spectra of photonic crystals Pietro Contu,

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research

EI331 Signals and Systems Lecture 17 Bo Jiang John Hopcroft Center for Computer Science

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling