Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo - - PowerPoint PPT Presentation
Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo - - PowerPoint PPT Presentation
Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication
Table of Contents
Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Table of Contents
Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Matrix multiplication
ORD-MULT(A, B, C) 1 for i ← 1 to m 2 for j ← 1 to p 3 for k ← 1 to n 4 Cij ← Cij + Aik × Bkj
Matrix layout
Like in C . . .
✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿- ✽
- ✑
- ✾
- ✾
- ✒
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63
(a)
- ✴
Figure: Row major order
✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿- ✽
- ✑
- ✾
- ✾
- ✒
- ✴
Matrix layout
Like in C . . .
✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿- ✽
- ✑
- ✾
- ✾
- ✒
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63
(a)
- ✴
Figure: Row major order
✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿- ✽
- ✑
- ✾
- ✾
- ✒
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63
(b)
- ✴
Figure: Column major order
Or like in Fortran
Cache friendly algorithm
BLOCK-MULT(A, B, C, n) 1 for i ← 1 to n/s 2 for j ← 1 to n/s 3 for k ← 1 to n/s 4 ORD-MULT(Aik, Bkj, Cij, s)
BLOCK-MULT issues
Being cache aware is hard:
◮ Cumbersome structure ◮ Complicated choice of s ◮ Expensive mispicking of s ◮ Problematic if n mod s = 0
Motivation
◮ Keeping algorithm simple is nice. ◮ But cache effectiveness is the must.
Table of Contents
Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
System model
✂✁☎✄✝✆✟✞✂✠✡✁☞☛✍✌✎✞✑✏✒☛✔✓✑✕✟✖✘✗✟✖✙✓✚☛✜✛✘✕✣✢✙✤✎✆✟✁✎✕✣✥✘✦✡✛✙✧★✓✟✞✎✕★✄✩✞✡✌✪✆✬✫✭✏✑✌★✛☞✫✯✮✰✢✪✄✱✫✲✞✎✤✎✧ ✳ ✴ ✵ ✶ ✴ ✷ ✸ ✵ ✴ ✷✹✴ ✸ ✵ ✴ ✷ ✵✺✵ ✴ ✵ ✶ ✶ ✴ ✷✻✴ ✷ ✷ ✵ ✸ ✷ ✸ ✼ ✵ ✽ ✾ ✿Q
cache misses
- rganized by
- ptimal replacement
strategy Main Memory Cache Z
✸ L Cache linesLines
- f length L
CPU
W
work
Figure 1: The ideal-cache model
❀ ❁ ✽ ✿✣✾ ✽ ✿ ✽ ✾ ✿◮ Two level memory ◮ Fully associative ◮ Strictly optimal replacement ◮ Automatic replacement ◮ Tall cache:
Z = Ω(L2), where: Z – number of words in the cache L – number of words in a cache line
Matrix multiplication
Given: A[m × n] × B[n × p] → C[m × p] A1 A2
- B =
A1B A2B
- ,
m ≥ max(n, p) (1)
- A1
A2 B1 B2
- = A1B1 + A2B2,
n ≥ max(m, p) (2) A
- B1
B2
- =
- AB1
AB2
- ,
p ≥ max(n, m) (3) Cij := Cij + Aik · Bkj, m = n = p = 1 (4)
Bounds
REC-MULT
Work: Θ(n3) Cache misses: Θ(n + n2/L + n3/L √ Z)
vs BLOCK-MULT
Work: Θ(n3) Cache misses: Θ(1 + n2/L + n3/L √ Z)
vs Strassen’s [2] (cache oblivious)
Work: Θ(nlog2 7) Cache misses: Θ(1 + n2/L + nlog2 7/L √ Z)
Matrix transposition
Given: A[m × n] → B[n × m] A =
- A1
A2
- , B =
B1 B2
- (5)
Bounds
REC-TRANSPOSE
Work: Θ(n · m) Cache misses: Θ(1 + mn/L) Asymptotically optimal
Na¨ ıve
Work: Θ(n · m) Cache misses: Θ(n · m)
Discrete Fourier Transform (DFT)
Compute: Y [i] =
n−1
- j=0
X[j]ω−ij
n ,
where ωn = e2π√−1/n Assume n = 2k | k ∈ N Choose n1 = 2⌈log2n/2⌉, n2 = 2⌊log2n/2⌋ Factorized Y (Cooley-Turkey algorithm): Y [i1 + i2n1] =
n2−1
- j2=0
n1−1
- j1=0
X[j1n2 + j2]ω−j1j2
n
ω−j1j2
n2
Sorting
Mergesort is not optimal with respect to cache misses.
- 1. Funnelsort
- 2. Distribution sort
◮ Recursive ◮ Asymptotically cache-optimal ◮ Not every recursive sort is cache optimal
Funnelsort
- 1. Split input into n
1 3 of size n 2 3 , and sort these arrays recursively
- 2. Merge n
1 3 sorted sequences using n 1 3 -merger
k-merger
L1 k-merger R buffers L
- k
Figure 3: Illustration of a k-merger. A k-merger is built recursively out of
✝ k “left” ✝ k-mergers L1, L2, ✍ ✍ ✍ , L ✝ k,a series of buffers, and one “right”
✝ k-merger R. ✽ ✿ ✽ ✆ ✽ ✁ ✿ ✽ ✆ ✿ ✿ ✞ ✞ ✞ ✞ ✝ ✝ ✝ ✝ ✝ ✾ ✾ ✍ ✍ ✍ ✾ ✝ ✝ ✞ ✝ ✝ ✝ ✞ ✞ ✝ ❁ ❁ ✞ ✞ ✞ ✞ ✽ ✿ ✽ ✿ ✽ ✿★❁ ✽ ✆ ✽ ✁ ✿ ✽ ✆ ✿✯✿ ✍ ✽ ✿ ✽ ✿ ✝ ✽ ✿ ✽ ✿ ✁ ✽ ✝ ✆ ✿ ✽ ✝ ✿ ✆ ✽ ✿✣✾ ✽ ✿✑❁ ✽ ✿ ✽ ✿ ✽ ✆ ✁ ✿ ✁Bounds
Work: O(n · log2n) Optimal cache misses: O(1 + (n/L)(1 + logZn))
Relieved system model
◮ LRU
◮ Θ(Q(n; Z; L))
◮ Multilevel cache
◮ inclusive cache
Table of Contents
Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Micro-benchmarks
0.02 0.04 0.06 0.08 0.1 0.12 100 200 300 400 500 600 Time (microseconds) N iterative recursive
Figure 5: Average time taken to multiply two N
- N
matrices, divided by N3.
✽ ✿ ✽ ✿ ✽ ✿- ✽
0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 1200 Time (microseconds) N iterative recursive
Figure 4: Average time to transpose an N
N matrix,divided by N2.
✽ ✿Real benchmarks [1]
5 10 15 20 100 1000 10000 100000 1e+06 Average number of cache misses per lookup Number of items Cache Misses for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit
- Fig. 4.8. Cache misses per lookup for static search algorithms
Real benchmarks [1]
200 400 600 800 1000 1200 100 1000 10000 100000 1e+06 Average number of instructions per lookup Number of items Instruction Count for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit
- Fig. 4.9. Instruction count per lookup for static search algorithms
Real benchmarks [1]
2 4 6 8 10 10000 100000 1e+06 Time in microseconds per lookup Number of items Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit
- Fig. 4.10. Execution time on Windows for static search algorithms
Table of Contents
Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
FFMK tribute slide
. . . FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW’s code generator produces straight-line “codelets”, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number
- f registers on the target architecture.