Cache-oblivious sparse matrixvector multiplication Albert-Jan - - PowerPoint PPT Presentation

cache oblivious sparse matrix vector multiplication
SMART_READER_LITE
LIVE PREVIEW

Cache-oblivious sparse matrixvector multiplication Albert-Jan - - PowerPoint PPT Presentation

Cache-oblivious sparse matrixvector multiplication Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling Cache-oblivious sparse


slide-1
SLIDE 1

Cache-oblivious sparse matrix–vector multiplication

Cache-oblivious sparse matrix–vector multiplication

Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling

Albert-Jan Yzelman & Rob Bisseling

slide-2
SLIDE 2

Cache-oblivious sparse matrix–vector multiplication

Motivations

Basic implementations can suffer up to 2x slowdown. Even worse: dedicated libraries may in some cases still show a similar level of inefficiency.

Albert-Jan Yzelman & Rob Bisseling

slide-3
SLIDE 3

Cache-oblivious sparse matrix–vector multiplication

Outline

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-4
SLIDE 4

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

Memory and multiplication

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-5
SLIDE 5

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

Cache parameters

Size S (in bytes) Line size LS (bytes) Number of cache lines L = (S/LS) Number of subcaches k Number of levels

Albert-Jan Yzelman & Rob Bisseling

slide-6
SLIDE 6

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

Naive cache

k = 1, modulo mapped cache Memory (of length LS) from RAM with start address x is stored in cache line number x mod L:

Main memory (RAM)

Albert-Jan Yzelman & Rob Bisseling

slide-7
SLIDE 7

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

’Ideal’ cache

Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: x1 ⇒

  • Req. x1, . . . , x4

x4 x3 x2 x1 ⇒

  • Req. x2

x2 x4 x3 x1 ⇒

  • Req. x5

x5 x2 x4 x3

Albert-Jan Yzelman & Rob Bisseling

slide-8
SLIDE 8

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

Realistic cache

1 < k < L, combining modulo-mapping and the LRU policy

Main memory (RAM) Cache Subcaches Modulo mapping LRU−stack

Albert-Jan Yzelman & Rob Bisseling

slide-9
SLIDE 9

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

Multilevel caches

Main memory (RAM) Cache (L1) Cache (L2)

✁✁✁ ✁✁✁ ✁✁✁ ✁✁✁ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆✁✆✁✆ ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ ✝✁✝✁✝✁✝

CPU

Intel Core2 AMD K8 L1: S = 32kB k = 8 L2: S = 4MB k = 16 L1: S = 16kB k = 2 L2: S = 1MB k = 16

Albert-Jan Yzelman & Rob Bisseling

slide-10
SLIDE 10

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-11
SLIDE 11

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-12
SLIDE 12

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-13
SLIDE 13

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-14
SLIDE 14

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-15
SLIDE 15

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒ y0 a01 x1 a00 x0

Albert-Jan Yzelman & Rob Bisseling

slide-16
SLIDE 16

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

When k, L are a bit larger, we can predict the following: the lower elements from the vector x (that is, x0, x1, . . . , xi for some i < n) are evicted while processing the entire first row. This causes O(n) cache misses on the remaining m − 1 rows. Fix: stop processing a row before an element from x would be evicted and first continue row-wise: i.e., process Ax by doing MVs on m × q submatrices: y = a0x + a1x + . . . Unwanted side effect: now lower elements from the vector y can be prematurely evicted... Fix: stop processing a submatrix before an element from y would be evicted; the MV routine now is applied on p × q submatrices. This approach is cache-aware; implemented in, e.g., GotoBLAS.

Albert-Jan Yzelman & Rob Bisseling

slide-17
SLIDE 17

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Standard datastructure: Compressed Row Storage (CRS)

Albert-Jan Yzelman & Rob Bisseling

slide-18
SLIDE 18

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Standard datastructure: Compressed Row Storage (CRS) A =     4 1 3 2 3 1 2 7 1 1     Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] row: [0 3 5 7 10]

Albert-Jan Yzelman & Rob Bisseling

slide-19
SLIDE 19

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-20
SLIDE 20

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-21
SLIDE 21

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-22
SLIDE 22

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-23
SLIDE 23

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒ a?? x? y0 a0? x? = ⇒

Albert-Jan Yzelman & Rob Bisseling

slide-24
SLIDE 24

Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒ a?? x? y0 a0? x? = ⇒ y? a?? x? y? a0? x? We cannot predict memory accesses in the sparse case!

Albert-Jan Yzelman & Rob Bisseling

slide-25
SLIDE 25

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Cache-friendly data structures

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-26
SLIDE 26

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Band datastructures

Instead of storing matrices row- or column-wise, store matrix diagonals A =   4 1 2 3 2   Stored as: (2) [4 2 0] (3) [1 3 2] Reference: Sivan Toledo, Improving the memory-system performance

  • f sparse-matrix vector multiplication, 1997

Albert-Jan Yzelman & Rob Bisseling

slide-27
SLIDE 27

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Fractal datastructures

A =     4 1 2 2 3 1 2 7 1     Stored as: nzs: [7 1 4 1 2 2 3 2 1] i : [3 2 0 0 1 0 1 2 3] j : [0 0 0 1 1 3 3 3 2] Reference: Haase, Liebmann and Plank, A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices, 2005

Albert-Jan Yzelman & Rob Bisseling

slide-28
SLIDE 28

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Blocked CRS

A =     4 1 3 2 3 1 2 7 1 1     , dense blocks: 4, 1, 3 - 2, 3 - 1 - 2 - 7 - 1, 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 2 0 3 0 2] row: [0 1 2 4 6] blk: [0 3 5 6 7 8] Reference: Pinar and Heath, Improving Performance of Sparse Matrix-Vector Multiplication, 1999

Albert-Jan Yzelman & Rob Bisseling

slide-29
SLIDE 29

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Zig-zag CRS

Change the order of CRS:

Albert-Jan Yzelman & Rob Bisseling

slide-30
SLIDE 30

Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures

Zig-zag CRS

A =     4 1 3 2 3 1 2 7 1 1     Stored as: nzs: [4 1 3 3 2 1 2 1 1 7] col: [0 1 2 3 2 0 3 3 2 0] row: [0 3 5 7 10] Reference: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods, accepted March 2009

Albert-Jan Yzelman & Rob Bisseling

slide-31
SLIDE 31

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Cache-oblivious sparse matrix structure

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-32
SLIDE 32

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Why not change the structure of the input matrix?

Assuming zig-zag CRS ordering

Albert-Jan Yzelman & Rob Bisseling

slide-33
SLIDE 33

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Why not change the structure of the input matrix?

Assuming zig-zag CRS ordering Allowing only row and column permutations

Albert-Jan Yzelman & Rob Bisseling

slide-34
SLIDE 34

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Why not change the structure of the input matrix?

Assuming zig-zag CRS ordering Allowing only row and column permutations This is, on a certain level, also matrix-oblivious. A further advantage is that by knowing that the reordered matrix A′ can be written as PAQ, we can still apply some strategies from the previous sections (e.g., Blocked CRS), or libraries relying on such strategies (e.g., OSKI).

Albert-Jan Yzelman & Rob Bisseling

slide-35
SLIDE 35

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Seperated Block Diagonal form

Albert-Jan Yzelman & Rob Bisseling

slide-36
SLIDE 36

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Seperated Block Diagonal form

Albert-Jan Yzelman & Rob Bisseling

slide-37
SLIDE 37

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Seperated Block Diagonal form

Given a specific cache, regarding the number of blocks p to be taken, there is a ’sweet spot’ at p = n wL. where w is the number of data elements that fit into a single cache line.

Albert-Jan Yzelman & Rob Bisseling

slide-38
SLIDE 38

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Seperated Block Diagonal form

No cache misses 1 cache miss per row 3 cache misses 1 cache miss 7 cache misses 15 cache misses

Albert-Jan Yzelman & Rob Bisseling

slide-39
SLIDE 39

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Input

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-40
SLIDE 40

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Column partitioning

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-41
SLIDE 41

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Column permutation

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-42
SLIDE 42

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Mixed row detection

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-43
SLIDE 43

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Row permutation

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-44
SLIDE 44

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Column subpartitioning

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-45
SLIDE 45

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Column permutation

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-46
SLIDE 46

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

No mixed rows - row permutation

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-47
SLIDE 47

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-48
SLIDE 48

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-49
SLIDE 49

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-50
SLIDE 50

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-51
SLIDE 51

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-52
SLIDE 52

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-53
SLIDE 53

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-54
SLIDE 54

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-55
SLIDE 55

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-56
SLIDE 56

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-57
SLIDE 57

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-58
SLIDE 58

Cache-oblivious sparse matrix–vector multiplication > Cache-oblivious sparse matrix structure

Permuting to SBD form

Continued

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman & Rob Bisseling

slide-59
SLIDE 59

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

Obtaining SBD form using partioners

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-60
SLIDE 60

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

Sparse matrix to hypergraph conversion

Transform a sparse matrix A to a hypergraph H = (V, N) according to the row-net model:

2 3 5 4 6 7 1

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂

1 2 3 4 5 6 7

Columns correspond to vertices and rows to hyperedges (nets).

Albert-Jan Yzelman & Rob Bisseling

slide-61
SLIDE 61

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

We partition the set of vertices V in V0 (blue) and V1 (green).

7 1 2 8 4 9 6 3 5

Mixed nets are easily detectable as cut nets in the hypergraph representation, marked in red.

Albert-Jan Yzelman & Rob Bisseling

slide-62
SLIDE 62

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

Eventually we will have partitioned V in a lot of subsets V0, . . . , Vp−1. The nets spread over multiple partitions make up those mixed-row blocks incurring more than 1 cache miss per row.

No cache misses 1 cache miss per row 3 cache misses 1 cache miss 7 cache misses 15 cache misses

Albert-Jan Yzelman & Rob Bisseling

slide-63
SLIDE 63

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

The number of subsets a net is cast over after partitioning, is called the connectivity λi. For p =

n wL, the number of cache misses is given by

  • i: ni∈N

λi − 1. This is also called the λ − 1 metric, and is already used extensively in parallel computing. Partitioners should also take into account the load-imbalance, typically denoted by ǫ. References: C ¸ataly¨ urek and Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication, 1999 Vastenhouw and Bisseling, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication, 2005 Our method has been implemented by adapting the Mondriaan software.

Albert-Jan Yzelman & Rob Bisseling

slide-64
SLIDE 64

Cache-oblivious sparse matrix–vector multiplication > Obtaining SBD form using partioners

For a truly cache-oblivious approach, the number of subsets p we partition V into, must be as large as possible: p → ∞ (p = n). Taking p much higher does not harm efficiency, and will even optimise for the smaller caches closer to the CPU; we optimise access for all cache levels irrespective of the cache parameters.

Albert-Jan Yzelman & Rob Bisseling

slide-65
SLIDE 65

Cache-oblivious sparse matrix–vector multiplication > Experimental results

Experimental results

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-66
SLIDE 66

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The memplus matrix

p = 1 (original)

Albert-Jan Yzelman & Rob Bisseling

slide-67
SLIDE 67

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The memplus matrix

p = 2, ǫ = 0.1

Albert-Jan Yzelman & Rob Bisseling

slide-68
SLIDE 68

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The memplus matrix

p = 100, ǫ = 0.1

Albert-Jan Yzelman & Rob Bisseling

slide-69
SLIDE 69

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The rhpentium matrix

p = 1 (original)

Albert-Jan Yzelman & Rob Bisseling

slide-70
SLIDE 70

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The rhpentium matrix

p = 100, ǫ = 0.1

Albert-Jan Yzelman & Rob Bisseling

slide-71
SLIDE 71

Cache-oblivious sparse matrix–vector multiplication > Experimental results

The rhpentium matrix

p = 400, ǫ = 0.1

Albert-Jan Yzelman & Rob Bisseling

slide-72
SLIDE 72

Cache-oblivious sparse matrix–vector multiplication > Experimental results

Cache-simulated results

Name p = 2 p = 100 p = 400 memplus 0.99 (C) 1.05 (C) 1.08 (Z) rhpentium 0.96 (Z) 0.66 (Z) 0.63 (I) s3dkt3m2 1.00 (I) 1.00 (C) 1.00 (C) rand10000 0.91 (C) 0.72 (C) 0.70 (I) fidap037 0.98 (C) 1.00 (C) 1.01 (C) Number of cache misses on reordered matrices (ǫ = 0.1) divided by the number of cache misses on the original matrix during SpMV. Simulated is an Intel Core2 L1 cache.

Albert-Jan Yzelman & Rob Bisseling

slide-73
SLIDE 73

Cache-oblivious sparse matrix–vector multiplication > Experimental results

1 2 3 4 5 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Matrix number Original 2/0.1 3 4 100 400 2/0.3 3 4 100 400

Matrices used are: 1. memplus, 2. rhpentium, 3. s3dkt3m2,

  • 4. rand10000, and 5. fidap037.

Albert-Jan Yzelman & Rob Bisseling

slide-74
SLIDE 74

Cache-oblivious sparse matrix–vector multiplication > Experimental results 11 12 13 14 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Matrix number Original 2 3 4 5 10 15 20

Matrices used are: 11. Stanford, 12. Stanford Berkeley, 13. wikipedia20051105, and 14. cage14.

Albert-Jan Yzelman & Rob Bisseling

slide-75
SLIDE 75

Cache-oblivious sparse matrix–vector multiplication > Experimental results

11 12 13 14 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Matrix number CRS 1 2 3 4 5 10 15 20

Matrices used are: 11. Stanford, 12. Stanford Berkeley, 13. wikipedia20051105, and 14. cage14.

Albert-Jan Yzelman & Rob Bisseling

slide-76
SLIDE 76

Cache-oblivious sparse matrix–vector multiplication > Experimental results

Pre-processing times

Matrix Reordering time SpMV time rhpentium, p = 400: 1 minute (0.9 / 0.7 ms.) memplus, p = 100: 4 minutes (0.4 / 0.4 ms.) stanford, p = 20: 4 minutes (27 / 14 ms.) cage14, p = 20: 12 minutes (109 / 123 ms.) stanford berkeley, p = 20: 13 minutes (31 / 30 ms.) wikipedia, p = 10: 16 minutes (353 / 236 ms.) wikipedia, p = 20: 45 minutes (353 / 237 ms.)

Albert-Jan Yzelman & Rob Bisseling

slide-77
SLIDE 77

Cache-oblivious sparse matrix–vector multiplication > Experimental results

Name p = 2 p = 3 p = 4 p = 100 p = 400 memplus 1531 1914 1818 12793 101744 rhpentium 5090 6752 7303 17064 60251 s3dkt3m2 603 673 740 1934 5181 rand10000 1560 1411 1820 23179 103565 fidap037 1005 1068 1131 5657 12761 Name p = 2 p = 3 p = 4 p = 10 p = 20 stanford 5139 7603 6828 7922 8332 stanford berkeley 11305 13208 15093 19138 21260 wikipedia20051105 2152 1992 2168 2570 7418 cage14 4238 3987 3635 4583 5611

Table: Cost of reordering in terms of the number matrix multiplications on the original matrix. Here, ǫ = 0.1. Construction times were measured on an Intel Core 2 (Q6600) machine.

Albert-Jan Yzelman & Rob Bisseling

slide-78
SLIDE 78

Cache-oblivious sparse matrix–vector multiplication > Conclusions & Future Work

Conclusions & Future Work

1

Memory and multiplication

2

Cache-friendly data structures

3

Cache-oblivious sparse matrix structure

4

Obtaining SBD form using partioners

5

Experimental results

6

Conclusions & Future Work

Albert-Jan Yzelman & Rob Bisseling

slide-79
SLIDE 79

Cache-oblivious sparse matrix–vector multiplication > Conclusions & Future Work

Conclusions: we have introduced a scheme capable of increasing SpMV performance up to a factor two, while: remaining cache-oblivious, remaining matrix-oblivious, and keeping open the option of using specialised sparse BLAS libraries

  • n the background.

For already structured matrices our approach does not obtain significant speedups, but does not decrease performance much. Future work: look into the use of two-dimensional partitioning methods, speed up matrix partitioning in Mondriaan.

Albert-Jan Yzelman & Rob Bisseling