Cache-Oblivious Algorithms and Data Structures Gerth Stlting Brodal - - PowerPoint PPT Presentation

cache oblivious algorithms and data structures
SMART_READER_LITE
LIVE PREVIEW

Cache-Oblivious Algorithms and Data Structures Gerth Stlting Brodal - - PowerPoint PPT Presentation


slide-1
SLIDE 1

Cache-Oblivious Algorithms and Data Structures

Gerth Stølting Brodal

University of Aarhus

  • ✁✂
✄ ☎ ✆ ✆ ✝✟✞ ✠☛✡ ☞ ✌✎✍ ✌✎✏ ✑ ✏ ✒ ☞ ✍✓ ☞ ✔ ✡ ✕ ✒ ✡ ✖ ✓ ✗ ✑ ✂ ✗ ✘ ✞ ✙ ☞ ✔ ✚ ✓ ✛✢✜ ✣ ✞ ✤ ✓ ✑ ✔ ✏ ✗ ✣ ✞ ✥ ☞ ✚✧✦ ★ ✞ ☎ ✆ ✆ ✝

1

slide-2
SLIDE 2

Outline

  • Motivation

– A typical workstation – A trivial program

  • Memory models

– I/O model – Ideal cache model

  • Basic cache-oblivious algorithms

– Matrix multiplication – Search trees – Sorting

  • Some experimental results
  • Conclusion

Cache-Oblivious Algorithms and Data Structures

2

slide-3
SLIDE 3

A Typical Workstation

Cache-Oblivious Algorithms and Data Structures

3

slide-4
SLIDE 4

Customizing a Dell 650

www.dell.dk www.intel.com

Processor speed 2.4 – 3.2 GHz L3 cache size 0.5 – 2 MB Memory 1/4 – 4 GB Hard Disk 36 GB – 146 GB 7.200 – 15.000 RPM CD/DVD 8 – 48x L2 cache size 256 – 512 KB L2 cache line size 128 Bytes L1 cache line size 64 Bytes L1 cache size 16 KB

Cache-Oblivious Algorithms and Data Structures

4

slide-5
SLIDE 5

Customizing a Dell 650

www.dell.dk www.intel.com

Processor speed 2.4 – 3.2 GHz L3 cache size 0.5 – 2 MB Memory 1/4 – 4 GB Hard Disk 36 GB – 146 GB 7.200 – 15.000 RPM CD/DVD 8 – 48x L2 cache size 256 – 512 KB L2 cache line size 128 Bytes L1 cache line size 64 Bytes L1 cache size 16 KB

D

  • w

e w a n t t

  • k

n

  • w

?

Cache-Oblivious Algorithms and Data Structures

4

slide-6
SLIDE 6

Hierarchical Memory Basics

CPU L1 L2 A R M

Increasing access time and space

L3 B1 B4 B3 B2 Disk

  • Data moved between adjacent memory levels in blocks

Cache-Oblivious Algorithms and Data Structures

5

slide-7
SLIDE 7

A Trivial Program

for (i=0; i+d<n; i+=d) A[i]=i+d; A[i]=0; for (i=0, j=0; j<8*1024*1024; j++) i=A[i];

d A n

Cache-Oblivious Algorithms and Data Structures

6

slide-8
SLIDE 8

A Trivial Program (cont.)

d = 1

20 40 60 80 100 120 140 160 180 200 5 10 15 20 25 Seconds log n

RAM : n ≈ 225 ≡ 128 MB

Cache-Oblivious Algorithms and Data Structures

7

slide-9
SLIDE 9

A Trivial Program (cont.)

d = 1

0.5 1 1.5 2 2.5 3 2 4 6 8 10 12 14 16 18 20 Seconds log n

L1 : n ≈ 212 ≡ 16 KB L2 : n ≈ 216 ≡ 256 KB

Cache-Oblivious Algorithms and Data Structures

8

slide-10
SLIDE 10

A Trivial Program (cont.)

n = 224

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5 10 15 20 25 Seconds log d

Cache line d = 23 ≡ 32 Bytes

Cache-Oblivious Algorithms and Data Structures

9

slide-11
SLIDE 11

A Trivial Program (cont.)

n = 224

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5 10 15 20 25 Seconds log d

Cache line d = 23 ≡ 32 Bytes

D

  • w

e w a n t t

  • k

n

  • w

?

Cache-Oblivious Algorithms and Data Structures

9

slide-12
SLIDE 12

A Trivial Program (cont.)

— If you want to know...

Experiments were performed on a DELL 8000, Pentium III, 850 MHz, 128 MB RAM, running Linux 2.4.2, and using gcc version 2.96 with optimization -O3 L1 instruction and data caches

  • 4-way set associative, 32-byte line size
  • 16 KB instruction cache and 16 KB write-back data cache

L2 level cache

  • 8-way set associative, 32-byte line size
  • 256 KB

www.Intel.com

Cache-Oblivious Algorithms and Data Structures

10

slide-13
SLIDE 13

Algorithmic Problem

  • Memory hierarchy has become a fact of life
  • Accessing non-local storage may take a very long time
  • Good locality is important for achieving high performance

Latency Relative to CPU Register 0.5 ns 1 L1 cache 0.5 ns 1-2 L2 cache 3 ns 2-7 DRAM 150 ns 80-200 TLB 500+ ns 200-2000 Disk 10 ms 10

  • Increasing

Cache-Oblivious Algorithms and Data Structures

11

slide-14
SLIDE 14

Algorithmic Problem

  • Modern hardware is not uniform — many different parameters

– Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed

Cache-Oblivious Algorithms and Data Structures

12

slide-15
SLIDE 15

Algorithmic Problem

  • Modern hardware is not uniform — many different parameters

– Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed

  • Programs should ideally run for many different parameters

Cache-Oblivious Algorithms and Data Structures

12

slide-16
SLIDE 16

Algorithmic Problem

  • Modern hardware is not uniform — many different parameters

– Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed

  • Programs should ideally run for many different parameters

– by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies

Cache-Oblivious Algorithms and Data Structures

12

slide-17
SLIDE 17

Algorithmic Problem

  • Modern hardware is not uniform — many different parameters

– Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed

  • Programs should ideally run for many different parameters

– by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies practice

Cache-Oblivious Algorithms and Data Structures

12

slide-18
SLIDE 18

Algorithmic Problem

  • Modern hardware is not uniform — many different parameters

– Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed

  • Programs should ideally run for many different parameters

– by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies practice

  • Programs are executed on unpredictable configurations

– Generic portable and scalable software libraries – Code downloaded from the Internet, e.g. Java applets – Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures

12

slide-19
SLIDE 19

Outline

  • Motivation

– A typical workstation – A trivial program

  • Memory models

– I/O model – Ideal cache model

  • Basic cache-oblivious algorithms

– Matrix multiplication – Search trees – Sorting

  • Some experimental results
  • Conclusion

Cache-Oblivious Algorithms and Data Structures

13

slide-20
SLIDE 20

Hierarchical Memory Models

— many parameters

Disk CPU L1 L2 A R M

Increasing access time and space

L3

  • Limited success since model to complicated

Cache-Oblivious Algorithms and Data Structures

14

slide-21
SLIDE 21

I/O Model — two parameters

Aggarwal and Vitter 1988

CPU

M e m

  • r

y

I/O

c a c h e

M B

  • Measure number of block transfers

between two memory levels

  • Bottleneck in many computations
  • Very successful (simplicity)

Cache-Oblivious Algorithms and Data Structures

15

slide-22
SLIDE 22

I/O Model — two parameters

Aggarwal and Vitter 1988

CPU

M e m

  • r

y

I/O

c a c h e

M B

  • Measure number of block transfers

between two memory levels

  • Bottleneck in many computations
  • Very successful (simplicity)

Limitations

  • Parameters B and M must be known
  • Does not handle multiple memory levels
  • Does not handle dynamic M

Cache-Oblivious Algorithms and Data Structures

15

slide-23
SLIDE 23

Ideal Cache Model — no parameters!?

Frigo, Leiserson, Prokop, Ramachandran 1999

  • Program with only one memory
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

  • Optimal off-line cache replacement

strategy arbitrary B and M

Cache-Oblivious Algorithms and Data Structures

16

slide-24
SLIDE 24

Ideal Cache Model — no parameters!?

Frigo, Leiserson, Prokop, Ramachandran 1999

  • Program with only one memory
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

  • Optimal off-line cache replacement

strategy arbitrary B and M Advantages

  • Optimal on arbitrary level ⇒ optimal on all levels
  • Portability, B and M not hard-wired into algorithm
  • Dynamic changing parameters

Cache-Oblivious Algorithms and Data Structures

16

slide-25
SLIDE 25

Justification of the Ideal-Cache Model

Frigo, Leiserson, Prokop, Ramachandran 1999

Optimal replacement LRU + 2 × cache size ⇒ at most 2 × cache misses

Sleator and Tarjan, 1985

Corollary TM,B(N) = O(T2M,B(N)) ⇒ #cache misses using LRU is O(TM,B(N)) Two memory levels Optimal cache-oblivious algorithm satisfying TM,B(N) = O(T2M,B(N)) ⇒ optimal #cache misses on each level of a multilevel LRU cache Fully associativity cache Simulation of LRU

  • Direct mapped cache
  • Explicit memory management
  • Dictionary (2-universal hash functions) of cache lines in memory
  • Expected O(1) access time to a cache line in memory

Cache-Oblivious Algorithms and Data Structures

17

slide-26
SLIDE 26

Outline

  • Motivation

– A typical workstation – A trivial program

  • Memory models

– I/O model – Ideal cache model

  • Basic cache-oblivious algorithms

– Matrix multiplication – Search trees – Sorting

  • Some experimental results
  • Conclusion

Cache-Oblivious Algorithms and Data Structures

18

slide-27
SLIDE 27

Warm-up : Scanning

sum = 0 for i = 1 to N do sum = sum + A[i] O

N

B

  • I/Os

N B A Cache-Oblivious Algorithms and Data Structures

19

slide-28
SLIDE 28

Warm-up : Scanning

sum = 0 for i = 1 to N do sum = sum + A[i] O

N

B

  • I/Os

N B A

Corollary Cache-oblivious selection requires O(N/B) I/Os

Hoare 1961 / Blum et al. 1973

Cache-Oblivious Algorithms and Data Structures

19

slide-29
SLIDE 29

Cache-Oblivious Matrix Multiplication

Cache-Oblivious Algorithms and Data Structures

20

slide-30
SLIDE 30

Matrix Multiplication

Problem Z = X · Y , zij =

N

  • k=1

xik · ykj Layout of matrices

1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 56 48 40 32 24 16 8 12 40 36 32 44 45 46 47 41 42 43 37 38 39 33 34 35 9 10 11 8 4 5 6 7 1 2 3 19 17 18 16 20 21 22 23 25 26 27 24 29 30 31 28 13 14 15 49 50 51 48 52 53 54 55 57 58 59 56 60 61 62 63 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 42 63 1 2 3 4 5 6 8 16 24 32 40 48 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 7

Row major Column major 4 × 4-blocked Bit interleaved

Cache-Oblivious Algorithms and Data Structures

21

slide-31
SLIDE 31

Matrix Multiplication

Algorithm 1: Nested loops – Row major for i = 1 to N for j = 1 to N zij = 0 for k = 1 to N zij = zij + xik · ykj – Reading a column of Y ⇒ N I/Os – Total O(N 3) I/Os

Cache-Oblivious Algorithms and Data Structures

22

slide-32
SLIDE 32

Matrix Multiplication

Algorithm 1: Nested loops – Row major for i = 1 to N for j = 1 to N zij = 0 for k = 1 to N zij = zij + xik · ykj – Reading a column of Y ⇒ N I/Os – Total O(N 3) I/Os Algorithm 2: Blocked algorithm (cache-aware) – Partition X and Y into blocks of size s × s,

s s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 56 48 40 32 24 16 8

s = Θ( √ M) – Apply Algorithm 1 to the N

s × N s matrices where

elements are s × s matrices

Cache-Oblivious Algorithms and Data Structures

22

slide-33
SLIDE 33

Matrix Multiplication

Algorithm 1: Nested loops – Row major for i = 1 to N for j = 1 to N zij = 0 for k = 1 to N zij = zij + xik · ykj – Reading a column of Y ⇒ N I/Os – Total O(N 3) I/Os Algorithm 2: Blocked algorithm (cache-aware) – Partition X and Y into blocks of size s × s,

s s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 56 48 40 32 24 16 8

s = Θ( √ M) – Apply Algorithm 1 to the N

s × N s matrices where

elements are s × s matrices – s × s-blocked or ( row major and M = Ω(B2) ) O

  • N

s

3 · s2

B

  • = O
  • N3

s·B

  • = O
  • N3

B √ M

  • I/Os

Cache-Oblivious Algorithms and Data Structures

22

slide-34
SLIDE 34

Matrix Multiplication

Algorithm 1: Nested loops – Row major for i = 1 to N for j = 1 to N zij = 0 for k = 1 to N zij = zij + xik · ykj – Reading a column of Y ⇒ N I/Os – Total O(N 3) I/Os Algorithm 2: Blocked algorithm (cache-aware) – Partition X and Y into blocks of size s × s,

s s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 56 48 40 32 24 16 8

s = Θ( √ M) – Apply Algorithm 1 to the N

s × N s matrices where

elements are s × s matrices – s × s-blocked or ( row major and M = Ω(B2) ) O

  • N

s

3 · s2

B

  • = O
  • N3

s·B

  • = O
  • N3

B √ M

  • I/Os

– Optimal

Hong & Kung, 1981

Cache-Oblivious Algorithms and Data Structures

22

slide-35
SLIDE 35

Matrix Multiplication

Algorithm 3: Recursive algorithm (cache-oblivious)

  X11 X12 X21 X22     Y11 Y12 Y21 Y22   =   X11Y11 + X12Y21 X11Y12 + X12Y22 X21Y11 + X22Y21 X21Y12 + X22Y22  

– 8 recursive N

2 × N 2 multiplications + 4 N 2 × N 2 matrix sums

Cache-Oblivious Algorithms and Data Structures

23

slide-36
SLIDE 36

Matrix Multiplication

Algorithm 3: Recursive algorithm (cache-oblivious)

  X11 X12 X21 X22     Y11 Y12 Y21 Y22   =   X11Y11 + X12Y21 X11Y12 + X12Y22 X21Y11 + X22Y21 X21Y12 + X22Y22  

– 8 recursive N

2 × N 2 multiplications + 4 N 2 × N 2 matrix sums

– # I/Os if bit interleaved or ( row major and M = Ω(B2) ) T(N) ≤

  

O( N2

B )

if N ≤ ε √ M 8 · T

  • N

2

  • + O
  • N2

B

  • therwise

T(N) ≤ O

N 3

B √ M

  • Cache-Oblivious Algorithms and Data Structures

23

slide-37
SLIDE 37

Matrix Multiplication

Algorithm 3: Recursive algorithm (cache-oblivious)

  X11 X12 X21 X22     Y11 Y12 Y21 Y22   =   X11Y11 + X12Y21 X11Y12 + X12Y22 X21Y11 + X22Y21 X21Y12 + X22Y22  

– 8 recursive N

2 × N 2 multiplications + 4 N 2 × N 2 matrix sums

– # I/Os if bit interleaved or ( row major and M = Ω(B2) ) T(N) ≤

  

O( N2

B )

if N ≤ ε √ M 8 · T

  • N

2

  • + O
  • N2

B

  • therwise

T(N) ≤ O

N 3

B √ M

  • – Optimal

Hong & Kung, 1981

– Non-square matrices

Frigo et al., 1999

Cache-Oblivious Algorithms and Data Structures

23

slide-38
SLIDE 38

Cache-Oblivious Matrix Multiplication

I/O bound O

  • N 3

B √ M

  • Techniques applied
  • Recursion / divide-and-conquer
  • Recursive memory layout
  • Scanning

Cache-Oblivious Algorithms and Data Structures

24

slide-39
SLIDE 39

Cache-Oblivious Search Trees

Cache-Oblivious Algorithms and Data Structures

25

slide-40
SLIDE 40

Static Search Trees

Recursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Searches use O(logB N) I/Os

Cache-Oblivious Algorithms and Data Structures

26

slide-41
SLIDE 41

Static Search Trees

Recursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Searches use O(logB N) I/Os Range reportings use O

  • logB N + K

B

  • I/Os

Cache-Oblivious Algorithms and Data Structures

26

slide-42
SLIDE 42

Static Search Trees

Recursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Searches use O(logB N) I/Os Range reportings use O

  • logB N + K

B

  • I/Os

Best possible (log2 e + o(1)) logB N

Bender, Brodal, Fagerberg, Ge, He, Hu, Iacono, López-Ortiz 2003

Cache-Oblivious Algorithms and Data Structures

26

slide-43
SLIDE 43

Dynamic Search Trees

  • Embed a dynamic tree of small height into a complete tree
  • Static van Emde Boas layout
  • Rebuild data structure whenever N doubles or halves

6 4 1 3 5 8 7 11 10 13

Search O(logB N) Range Reporting O

  • logB N + K

B

  • Updates

O

  • logB N + log2 N

B

  • Brodal, Fagerberg, Jacob 2001

Cache-Oblivious Algorithms and Data Structures

27

slide-44
SLIDE 44

Example

6 4 1 3 5 8 7 11 10 13

6 4 8 1 − 3 5 − − 7 − − 11 10 13

Cache-Oblivious Algorithms and Data Structures

28

slide-45
SLIDE 45

Binary Trees of Small Height

6 4 1 3 5 8 7 11 10 13 2 New 6 3 1 2 4 8 7 11 10 13 5

  • If an insertion causes non-small height then rebuild subtree

at nearest ancestor with sufficient few descendants

  • Insertions require amortized time O(log2 N)

Andersson and Lai 1990

Cache-Oblivious Algorithms and Data Structures

29

slide-46
SLIDE 46

Cache-Oblivious Search Trees

I/O bounds Search O (logB N) Range Reporting O

  • logB N + K

B

  • Updates

  

O

  • logB N + log2 N

B

  • O (logB N)

(no range reporting) Techniques applied

  • Recursion
  • Recursive memory layout
  • Scanning (updates, range reporting)

Cache-Oblivious Algorithms and Data Structures

30

slide-47
SLIDE 47

Cache-Oblivious Sorting

Cache-Oblivious Algorithms and Data Structures

31

slide-48
SLIDE 48

Sorting Problem

K A T A J A I N E N

A A A E I J K N N T

Cache-Oblivious Algorithms and Data Structures

32

slide-49
SLIDE 49

Binary Merge-Sort

2 8 4 8 4 4 6 4 2 8 4 3 2 8 3 4 4 4 8 6 4 3 4 3 4 8 4 6 2 8 4 4 8 2 8 4 4 4 6

Merging Merging Merging Ouput Input Merging

Cache-Oblivious Algorithms and Data Structures

33

slide-50
SLIDE 50

Binary Merge-Sort

2 8 4 8 4 4 6 4 2 8 4 3 2 8 3 4 4 4 8 6 4 3 4 3 4 8 4 6 2 8 4 4 8 2 8 4 4 4 6

Merging Merging Merging Ouput Input Merging

  • Recursive; two arrays; size O(M) internally in cache
  • O(N log N) comparisons
  • O
  • N

B log2 N M

  • I/Os

Cache-Oblivious Algorithms and Data Structures

33

slide-51
SLIDE 51

Merge-Sort

Degree I/O 2 O

  • N

B log2 N M

  • d

O

  • N

B logd N M

  • (d ≤ M

B − 1)

Θ

  • M

B

  • O
  • N

B logM/B N M

  • = O(SortM,B(N))

Aggarwal and Vitter 1988

Funnel-Sort

2 O( 1

ε SortM,B(N))

(M ≥ B1+ε)

Frigo, Leiserson, Prokop and Ramachandran 1999 Brodal and Fagerberg 2002

Cache-Oblivious Algorithms and Data Structures

34

slide-52
SLIDE 52

Lower Bound

Brodal and Fagerberg 2003

Block Size Memory I/Os Machine 1 B1 M t1 Machine 2 B2 M t2 One algorithm, two machines, B1 ≤ B2 Trade-off 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N

Cache-Oblivious Algorithms and Data Structures

35

slide-53
SLIDE 53

Lower Bound

Assumption I/Os Lazy Funnel-sort B ≤ M 1−ε (a) B2 = M 1−ε : SortB2,M(N) (b) B1 = 1 : SortB1,M(N) · 1

ε

Binary Merge-sort B ≤ M/2 (a) B2 = M/2 : SortB2,M(N) (b) B1 = 1 : SortB1,M(N) · log M

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁

1 M 1/2 M 1−ǫ M B:

Penalty

Theorem This is tight. For any cache-oblivious comparison based sorting algorithm: (a) ⇒ (b)

Cache-Oblivious Algorithms and Data Structures

36

slide-54
SLIDE 54

Funnel-Sort

Cache-Oblivious Algorithms and Data Structures

37

slide-55
SLIDE 55

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

Cache-Oblivious Algorithms and Data Structures

38

slide-56
SLIDE 56

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

=

Recursive def.

B1 · · · · · · · · · M1 M√ k M0 B√ k

← buffers of size k3/2 ← k1/2-mergers

Cache-Oblivious Algorithms and Data Structures

38

slide-57
SLIDE 57

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

=

Recursive def.

B1 · · · · · · · · · M1 M√ k M0 B√ k

← buffers of size k3/2 ← k1/2-mergers

· · ·

M0 M1 B1 B√

k M√ k

B2 M2

Recursive Layout

Cache-Oblivious Algorithms and Data Structures

38

slide-58
SLIDE 58

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Cache-Oblivious Algorithms and Data Structures

39

slide-59
SLIDE 59

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step

Cache-Oblivious Algorithms and Data Structures

39

slide-60
SLIDE 60

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step

Lemma If M ≥ B2 and output buffer has size k3 then O( k3

B logM(k3) + k) I/Os are

done during an invocation of Fill(root)

Cache-Oblivious Algorithms and Data Structures

39

slide-61
SLIDE 61

Funnel-Sort

Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N 1/3 segments of size N 2/3 Recursively Funnel-Sort each segment Merge sorted segments by an N 1/3-merger

k N1/3 N2/9 N4/27 . . . 2

Cache-Oblivious Algorithms and Data Structures

40

slide-62
SLIDE 62

Funnel-Sort

Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N 1/3 segments of size N 2/3 Recursively Funnel-Sort each segment Merge sorted segments by an N 1/3-merger

k N1/3 N2/9 N4/27 . . . 2

Theorem Funnel-Sort performs O(SortM,B(N)) I/Os for M ≥ B2

Cache-Oblivious Algorithms and Data Structures

40

slide-63
SLIDE 63

Cache-Oblivious Sorting

I/O bounds O

N

B logM/B N B

  • assuming a tall cache

M = Ω

  • B1+ε

Techniques applied

  • Divide-and-conquer
  • Recursive memory layout (k-merger)
  • Scanning (buffers)
  • Buffers of double-exponential size

Cache-Oblivious Algorithms and Data Structures

41

slide-64
SLIDE 64

Outline

  • Motivation

– A typical workstation – A trivial program

  • Memory models

– I/O model – Ideal cache model

  • Basic cache-oblivious algorithms

– Matrix multiplication – Search trees – Sorting

  • Some experimental results
  • Conclusion

Cache-Oblivious Algorithms and Data Structures

42

slide-65
SLIDE 65

Memory Layouts of Trees

DFS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

inorder

8 4 2 1 3 6 5 7 12 10 9 11 14 13 15

BFS

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15

van Emde Boas

1 2 4 5 6 7 8 9 3 10 11 12 13 14 15

(in theory best)

Cache-Oblivious Algorithms and Data Structures

43

slide-66
SLIDE 66

Searches in Pointer Based Layouts

Brodal, Fagerberg, Jacob 2002

0.0001 0.001 1000 10000 100000 1e+06 pointer bfs pointer dfs pointer vEB pointer random insert pointer random layout

  • van Emde Boas layout wins, followed by the BFS layout

Cache-Oblivious Algorithms and Data Structures

44

slide-67
SLIDE 67

Searches with Implicit Layouts

Brodal, Fagerberg, Jacob 2002

0.0001 0.001 1000 10000 100000 1e+06 implicit bfs implicit dfs implicit vEB implicit in-order implicit 9-ary bfs

  • BFS layout wins due to simplicity and caching of topmost levels
  • van Emde Boas layout requires quite complex index computations

Cache-Oblivious Algorithms and Data Structures

45

slide-68
SLIDE 68

Searches with Implicit Layouts

Brodal, Fagerberg, Jacob 2002

1e-06 1e-05 0.0001 0.001 0.01 0.1 20 21 22 23 24 25 26 27 28 29 bfs veb high1024

  • van Emde Boas is competitive with a B-tree beyond main memory

Cache-Oblivious Algorithms and Data Structures

46

slide-69
SLIDE 69

8e-09 1e-08 1.2e-08 1.4e-08 1.6e-08 1.8e-08 2e-08 2.2e-08 2.4e-08 2.6e-08 2.8e-08 12 14 16 18 20 22 24 26 28 Walltime/n*log n log n Uniform pairs - Itanium 2 funnelsort2 GCC msort-c msort-m

Engineering a Cache-Oblivious Sorting Algorithm, Brodal, Fagerberg, Vinther, 2004

Cache-Oblivious Algorithms and Data Structures

47

slide-70
SLIDE 70

Outline

  • Motivation

– A typical workstation – A trivial program

  • Memory models

– I/O model – Ideal cache model

  • Basic cache-oblivious algorithms

– Matrix multiplication – Search trees – Sorting

  • Some experimental results
  • Conclusion

Cache-Oblivious Algorithms and Data Structures

48

slide-71
SLIDE 71

Conclusion

  • The ideal cache model helps designing competitive and

robust algorithms

  • Basic techniques: Scanning, recursion/divide-and-conquer,

recursive memory layout, sorting

  • Many other problems studied: Permuting, FFT,

Matrix transposition, Priority queues, Graph algorithms, Computational geometry... Overhead involved in being cache-oblivious can be small enough for the nice theoretical proper- ties to transfer into practical advantages

Cache-Oblivious Algorithms and Data Structures

49