cache oblivious sparse matrix vector multiplication
play

Cache-oblivious sparse matrixvector multiplication Albert-Jan - PowerPoint PPT Presentation

Cache-oblivious sparse matrixvector multiplication Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling Cache-oblivious sparse


  1. Cache-oblivious sparse matrix–vector multiplication Cache-oblivious sparse matrix–vector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling

  2. Cache-oblivious sparse matrix–vector multiplication Motivations Basic implementations can suffer up to 2x slowdown. Even worse: dedicated libraries may in some cases still show a similar level of inefficiency. Albert-Jan Yzelman & Rob Bisseling

  3. Cache-oblivious sparse matrix–vector multiplication Outline Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

  4. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Memory and multiplication Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

  5. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Cache parameters Size S (in bytes) Line size L S (bytes) Number of cache lines L = ( S / L S ) Number of subcaches k Number of levels Albert-Jan Yzelman & Rob Bisseling

  6. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Naive cache k = 1, modulo mapped cache Memory (of length L S ) from RAM with start address x is stored in cache line number x mod L : Main memory (RAM) Albert-Jan Yzelman & Rob Bisseling

  7. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication ’Ideal’ cache Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: Req. x 1 , . . . , x 4 Req. x 2 Req. x 5 x 1 x 4 x 2 x 5 ⇒ x 3 ⇒ x 4 ⇒ x 2 x 2 x 3 x 4 x 1 x 1 x 3 Albert-Jan Yzelman & Rob Bisseling

  8. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Realistic cache 1 < k < L , combining modulo-mapping and the LRU policy Modulo mapping Cache LRU−stack Main memory (RAM) Subcaches Albert-Jan Yzelman & Rob Bisseling

  9. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Multilevel caches Main memory ✂✁✂✁✂✁✂ �✁�✁�✁� ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ �✁�✁�✁� ✂✁✂✁✂✁✂ Cache Cache ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ CPU ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ✂✁✂✁✂✁✂ �✁�✁�✁� ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ (L1) ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ (L2) ✂✁✂✁✂✁✂ �✁�✁�✁� (RAM) Intel Core2 AMD K8 L1: S = 32kB k = 8 L1: S = 16kB k = 2 L2: S = 4MB k = 16 L2: S = 1MB k = 16 Albert-Jan Yzelman & Rob Bisseling

  10. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 ⇒ = Albert-Jan Yzelman & Rob Bisseling

  11. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 x 0 ⇒ ⇒ = = Albert-Jan Yzelman & Rob Bisseling

  12. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 0 a 00 ⇒ ⇒ ⇒ = = x 0 = Albert-Jan Yzelman & Rob Bisseling

  13. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 x 0 a 00 y 0 ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = x 0 Albert-Jan Yzelman & Rob Bisseling

  14. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 x 0 a 00 y 0 x 1 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 0 a 00 x 0 Albert-Jan Yzelman & Rob Bisseling

  15. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication       a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 y 0 x 0 a 00 y 0 x 1 a 01 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 1 x 0 a 00 a 00 x 0 x 0 Albert-Jan Yzelman & Rob Bisseling

  16. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication When k , L are a bit larger, we can predict the following: the lower elements from the vector x (that is, x 0 , x 1 , . . . , x i for some i < n ) are evicted while processing the entire first row. This causes O ( n ) cache misses on the remaining m − 1 rows. Fix: stop processing a row before an element from x would be evicted and first continue row-wise: i.e., process Ax by doing MVs on m × q submatrices: y = a 0 x + a 1 x + . . . Unwanted side effect: now lower elements from the vector y can be prematurely evicted... Fix: stop processing a submatrix before an element from y would be evicted; the MV routine now is applied on p × q submatrices. This approach is cache-aware ; implemented in, e.g., GotoBLAS. Albert-Jan Yzelman & Rob Bisseling

  17. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS) Albert-Jan Yzelman & Rob Bisseling

  18. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS)  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] row: [0 3 5 7 10] Albert-Jan Yzelman & Rob Bisseling

  19. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? = ⇒ Albert-Jan Yzelman & Rob Bisseling

  20. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? x ? = ⇒ = ⇒ Albert-Jan Yzelman & Rob Bisseling

  21. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a 0? x ? = = ⇒ = ⇒ ⇒ Albert-Jan Yzelman & Rob Bisseling

  22. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? x ? a 0? y 0 x ? = a 0? = ⇒ = ⇒ ⇒ = ⇒ x ? Albert-Jan Yzelman & Rob Bisseling

  23. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? x ? a 0? y 0 x ? x ? = a 0? y 0 = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? x ? Albert-Jan Yzelman & Rob Bisseling

  24. Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? y ? x ? a 0? y 0 x ? a ?? x ? = a 0? y 0 x ? = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? y ? x ? a 0? x ? We cannot predict memory accesses in the sparse case! Albert-Jan Yzelman & Rob Bisseling

  25. Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures Cache-friendly data structures Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend