Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo - - PowerPoint PPT Presentation

cache oblivious algorithms
SMART_READER_LITE
LIVE PREVIEW

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo - - PowerPoint PPT Presentation

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication


slide-1
SLIDE 1

Cache-Oblivious Algorithms

Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015

slide-2
SLIDE 2

Table of Contents

Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

slide-3
SLIDE 3

Table of Contents

Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

slide-4
SLIDE 4

Matrix multiplication

ORD-MULT(A, B, C) 1 for i ← 1 to m 2 for j ← 1 to p 3 for k ← 1 to n 4 Cij ← Cij + Aik × Bkj

slide-5
SLIDE 5

Matrix layout

Like in C . . .

✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿
✿ ❀
✁ ✝ ✾ ✾ ❀ ✝ ✴ ✂ ✂ ✵✄✂ ☎✆ ✆ ✝ ✆ ✆✞ ✴ ✴ ✷ ✷ ✵ ✸ ✵ ✂ ✂ ✟✡✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ☞ ☞ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ✌ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✍ ✽ ✽ ✆ ✆ ✿ ✁ ✿ ✽✯✽ ✆ ✆ ✿ ✁ ✿ ✽ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✝ ✿ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝

0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63

(a)

✂ ✂ ✵✎✂ ☎ ✝ ✞ ✴ ✷ ✷ ✸ ✷ ✵ ✂ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✆ ✁ ✝ ✿ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✝ ✁ ✁ ✁ ✝ ✴ ✂ ✵✄✂ ✑ ✴ ✷ ✵ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✁ ✝ ✿ ✾ ✾ ✁ ✝ ✽ ✆ ✁ ✆ ✁ ✆ ✁ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✽ ✆ ✆ ✿ ✁ ✿

Figure: Row major order

✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿
✿ ❀
✁ ✝ ✾ ✾ ❀ ✝ ✴ ✂ ✂ ✵✄✂ ☎✆ ✆ ✝ ✆ ✆✞ ✴ ✴ ✷ ✷ ✵ ✸ ✵ ✂ ✂ ✟✡✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ☞ ☞ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ✌ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✍ ✽ ✽ ✆ ✆ ✿ ✁ ✿ ✽✯✽ ✆ ✆ ✿ ✁ ✿ ✽ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✝ ✿ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝
✂ ✂ ✵✎✂ ☎ ✝ ✞ ✴ ✷ ✷ ✸ ✷ ✵ ✂ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✆ ✁ ✝ ✿ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✝ ✁ ✁ ✁ ✝ ✴ ✂ ✵✄✂ ✑ ✴ ✷ ✵ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✁ ✝ ✿ ✾ ✾ ✁ ✝ ✽ ✆ ✁ ✆ ✁ ✆ ✁ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✽ ✆ ✆ ✿ ✁ ✿
slide-6
SLIDE 6

Matrix layout

Like in C . . .

✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿
✿ ❀
✁ ✝ ✾ ✾ ❀ ✝ ✴ ✂ ✂ ✵✄✂ ☎✆ ✆ ✝ ✆ ✆✞ ✴ ✴ ✷ ✷ ✵ ✸ ✵ ✂ ✂ ✟✡✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ☞ ☞ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ✌ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✍ ✽ ✽ ✆ ✆ ✿ ✁ ✿ ✽✯✽ ✆ ✆ ✿ ✁ ✿ ✽ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✝ ✿ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝

0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63

(a)

✂ ✂ ✵✎✂ ☎ ✝ ✞ ✴ ✷ ✷ ✸ ✷ ✵ ✂ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✆ ✁ ✝ ✿ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✝ ✁ ✁ ✁ ✝ ✴ ✂ ✵✄✂ ✑ ✴ ✷ ✵ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✁ ✝ ✿ ✾ ✾ ✁ ✝ ✽ ✆ ✁ ✆ ✁ ✆ ✁ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✽ ✆ ✆ ✿ ✁ ✿

Figure: Row major order

✏ ✑ ✾ ✒ ❁ ❁ ❁ ✽ ✿ ✽ ✆ ✆ ✆ ✽ ✆ ✆ ✿ ✁ ✆ ✁ ✝ ✿
✿ ❀
✁ ✝ ✾ ✾ ❀ ✝ ✴ ✂ ✂ ✵✄✂ ☎✆ ✆ ✝ ✆ ✆✞ ✴ ✴ ✷ ✷ ✵ ✸ ✵ ✂ ✂ ✟✡✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ☞ ☞ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ✌ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✍ ✽ ✽ ✆ ✆ ✿ ✁ ✿ ✽✯✽ ✆ ✆ ✿ ✁ ✿ ✽ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✝ ✿ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝ ✁ ✝ ✾ ❀ ✝

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63

(b)

✂ ✂ ✵✎✂ ☎ ✝ ✞ ✴ ✷ ✷ ✸ ✷ ✵ ✂ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✂ ✸ ✂ ✵ ✷ ✴ ✵ ☞ ✂ ✴ ✂ ✂ ✸ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✁ ✆ ✁ ✝ ✿ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✾ ✁ ✝ ❀ ✝ ✝ ✁ ✁ ✁ ✝ ✴ ✂ ✵✄✂ ✑ ✴ ✷ ✵ ✟✏✠ ✼ ✸ ✂ ✼ ☛ ✂ ✴ ✸ ✂ ✂ ✵ ✷ ✴ ✵ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✁ ✝ ✿ ✾ ✾ ✁ ✝ ✽ ✆ ✁ ✆ ✁ ✆ ✁ ✿ ✽ ✾ ✾ ✿ ❁ ✽ ✆ ✽ ✆ ✆ ✿ ✁ ✿

Figure: Column major order

Or like in Fortran

slide-7
SLIDE 7

Cache friendly algorithm

BLOCK-MULT(A, B, C, n) 1 for i ← 1 to n/s 2 for j ← 1 to n/s 3 for k ← 1 to n/s 4 ORD-MULT(Aik, Bkj, Cij, s)

slide-8
SLIDE 8

BLOCK-MULT issues

Being cache aware is hard:

◮ Cumbersome structure ◮ Complicated choice of s ◮ Expensive mispicking of s ◮ Problematic if n mod s = 0

slide-9
SLIDE 9

Motivation

◮ Keeping algorithm simple is nice. ◮ But cache effectiveness is the must.

slide-10
SLIDE 10

Table of Contents

Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

slide-11
SLIDE 11

System model

✂✁☎✄✝✆✟✞✂✠✡✁☞☛✍✌✎✞✑✏✒☛✔✓✑✕✟✖✘✗✟✖✙✓✚☛✜✛✘✕✣✢✙✤✎✆✟✁✎✕✣✥✘✦✡✛✙✧★✓✟✞✎✕★✄✩✞✡✌✪✆✬✫✭✏✑✌★✛☞✫✯✮✰✢✪✄✱✫✲✞✎✤✎✧ ✳ ✴ ✵ ✶ ✴ ✷ ✸ ✵ ✴ ✷✹✴ ✸ ✵ ✴ ✷ ✵✺✵ ✴ ✵ ✶ ✶ ✴ ✷✻✴ ✷ ✷ ✵ ✸ ✷ ✸ ✼ ✵ ✽ ✾ ✿

Q

cache misses

  • rganized by
  • ptimal replacement

strategy Main Memory Cache Z

✸ L Cache lines

Lines

  • f length L

CPU

W

work

Figure 1: The ideal-cache model

❀ ❁ ✽ ✿✣✾ ✽ ✿ ✽ ✾ ✿

◮ Two level memory ◮ Fully associative ◮ Strictly optimal replacement ◮ Automatic replacement ◮ Tall cache:

Z = Ω(L2), where: Z – number of words in the cache L – number of words in a cache line

slide-12
SLIDE 12

Matrix multiplication

Given: A[m × n] × B[n × p] → C[m × p] A1 A2

  • B =

A1B A2B

  • ,

m ≥ max(n, p) (1)

  • A1

A2 B1 B2

  • = A1B1 + A2B2,

n ≥ max(m, p) (2) A

  • B1

B2

  • =
  • AB1

AB2

  • ,

p ≥ max(n, m) (3) Cij := Cij + Aik · Bkj, m = n = p = 1 (4)

slide-13
SLIDE 13

Bounds

REC-MULT

Work: Θ(n3) Cache misses: Θ(n + n2/L + n3/L √ Z)

vs BLOCK-MULT

Work: Θ(n3) Cache misses: Θ(1 + n2/L + n3/L √ Z)

vs Strassen’s [2] (cache oblivious)

Work: Θ(nlog2 7) Cache misses: Θ(1 + n2/L + nlog2 7/L √ Z)

slide-14
SLIDE 14

Matrix transposition

Given: A[m × n] → B[n × m] A =

  • A1

A2

  • , B =

B1 B2

  • (5)
slide-15
SLIDE 15

Bounds

REC-TRANSPOSE

Work: Θ(n · m) Cache misses: Θ(1 + mn/L) Asymptotically optimal

Na¨ ıve

Work: Θ(n · m) Cache misses: Θ(n · m)

slide-16
SLIDE 16

Discrete Fourier Transform (DFT)

Compute: Y [i] =

n−1

  • j=0

X[j]ω−ij

n ,

where ωn = e2π√−1/n Assume n = 2k | k ∈ N Choose n1 = 2⌈log2n/2⌉, n2 = 2⌊log2n/2⌋ Factorized Y (Cooley-Turkey algorithm): Y [i1 + i2n1] =

n2−1

  • j2=0

   

n1−1

  • j1=0

X[j1n2 + j2]ω−j1j2

n

  ω−j1j2

n2

 

slide-17
SLIDE 17

Sorting

Mergesort is not optimal with respect to cache misses.

  • 1. Funnelsort
  • 2. Distribution sort

◮ Recursive ◮ Asymptotically cache-optimal ◮ Not every recursive sort is cache optimal

slide-18
SLIDE 18

Funnelsort

  • 1. Split input into n

1 3 of size n 2 3 , and sort these arrays recursively

  • 2. Merge n

1 3 sorted sequences using n 1 3 -merger

slide-19
SLIDE 19

k-merger

L1 k-merger R buffers L

  • k

Figure 3: Illustration of a k-merger. A k-merger is built recursively out of

✝ k “left” ✝ k-mergers L1, L2, ✍ ✍ ✍ , L ✝ k,

a series of buffers, and one “right”

✝ k-merger R. ✽ ✿ ✽ ✆ ✽ ✁ ✿ ✽ ✆ ✿ ✿ ✞ ✞ ✞ ✞ ✝ ✝ ✝ ✝ ✝ ✾ ✾ ✍ ✍ ✍ ✾ ✝ ✝ ✞ ✝ ✝ ✝ ✞ ✞ ✝ ❁ ❁ ✞ ✞ ✞ ✞ ✽ ✿ ✽ ✿ ✽ ✿★❁ ✽ ✆ ✽ ✁ ✿ ✽ ✆ ✿✯✿ ✍ ✽ ✿ ✽ ✿ ✝ ✽ ✿ ✽ ✿ ✁ ✽ ✝ ✆ ✿ ✽ ✝ ✿ ✆ ✽ ✿✣✾ ✽ ✿✑❁ ✽ ✿ ✽ ✿ ✽ ✆ ✁ ✿ ✁
slide-20
SLIDE 20

Bounds

Work: O(n · log2n) Optimal cache misses: O(1 + (n/L)(1 + logZn))

slide-21
SLIDE 21

Relieved system model

◮ LRU

◮ Θ(Q(n; Z; L))

◮ Multilevel cache

◮ inclusive cache

slide-22
SLIDE 22

Table of Contents

Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

slide-23
SLIDE 23

Micro-benchmarks

0.02 0.04 0.06 0.08 0.1 0.12 100 200 300 400 500 600 Time (microseconds) N iterative recursive

Figure 5: Average time taken to multiply two N

  • N

matrices, divided by N3.

✽ ✿ ✽ ✿ ✽ ✿
✿ ✽ ✆ ✁ ✆ ✁ ✝ ✿

0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 1200 Time (microseconds) N iterative recursive

Figure 4: Average time to transpose an N

N matrix,

divided by N2.

✽ ✿
slide-24
SLIDE 24

Real benchmarks [1]

5 10 15 20 100 1000 10000 100000 1e+06 Average number of cache misses per lookup Number of items Cache Misses for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit

  • Fig. 4.8. Cache misses per lookup for static search algorithms
slide-25
SLIDE 25

Real benchmarks [1]

200 400 600 800 1000 1200 100 1000 10000 100000 1e+06 Average number of instructions per lookup Number of items Instruction Count for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit

  • Fig. 4.9. Instruction count per lookup for static search algorithms
slide-26
SLIDE 26

Real benchmarks [1]

2 4 6 8 10 10000 100000 1e+06 Time in microseconds per lookup Number of items Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit

  • Fig. 4.10. Execution time on Windows for static search algorithms
slide-27
SLIDE 27

Table of Contents

Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

slide-28
SLIDE 28

FFMK tribute slide

. . . FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW’s code generator produces straight-line “codelets”, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number

  • f registers on the target architecture.
slide-29
SLIDE 29

Open questions

◮ Is there a gap in asymptotic complexity? ◮ Is there a limit as to how much better a cache-aware

algorithm can be?

slide-30
SLIDE 30

Conclusion

◮ Seem to be slower ◮ Provide cache optimality without knowing cache size ◮ Based on recursion

slide-31
SLIDE 31

Richard E Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, pages 78–92. Springer, 2002. Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969.