CSE 373: B-trees Michael Lee Wednesday, Jan 31, 2018 1 Motivation - - PowerPoint PPT Presentation

cse 373 b trees
SMART_READER_LITE
LIVE PREVIEW

CSE 373: B-trees Michael Lee Wednesday, Jan 31, 2018 1 Motivation - - PowerPoint PPT Presentation

CSE 373: B-trees Michael Lee Wednesday, Jan 31, 2018 1 Motivation What weve done so far: study difgerent dictionary implementations They all make one common assumption: all our data is stored in in-memory, on RAM . 2 ArrayDictionary


slide-1
SLIDE 1

CSE 373: B-trees

Michael Lee Wednesday, Jan 31, 2018

1

slide-2
SLIDE 2

Motivation

What we’ve done so far: study difgerent dictionary implementations ◮ ArrayDictionary ◮ SortedArrayDictionary ◮ Binary search trees ◮ AVL trees ◮ Hash tables They all make one common assumption: all our data is stored in in-memory, on RAM.

2

slide-3
SLIDE 3

Motivation

What we’ve done so far: study difgerent dictionary implementations ◮ ArrayDictionary ◮ SortedArrayDictionary ◮ Binary search trees ◮ AVL trees ◮ Hash tables They all make one common assumption: all our data is stored in in-memory, on RAM.

2

slide-4
SLIDE 4

Motivation

New challenge: what if our data is too large to store all in RAM? (For example, if we were trying to implement a database?) How can we do this effjciently? Two techniques: A tree-based technique Excels for range-lookups (e.g. “fjnd all users with an age between 20 and 30”, where “age” is the key) A hash-based technique Excels for specifjc key-value pair lookups

3

slide-5
SLIDE 5

Motivation

New challenge: what if our data is too large to store all in RAM? (For example, if we were trying to implement a database?) How can we do this effjciently? Two techniques: ◮ A tree-based technique Excels for range-lookups (e.g. “fjnd all users with an age between 20 and 30”, where “age” is the key) ◮ A hash-based technique Excels for specifjc key-value pair lookups

3

slide-6
SLIDE 6

A tree-based technique

Idea 1: Use an AVL tree Suppose the tree has a height of 50. In the best case, how many disk accesses do we need to make? In the worst case? In the best case, the nodes we want happen to be stored in RAM, so we need zero accesses. In the worst case, each node is stored on a difgerent page on disk, so we need to make 50 accesses.

4

slide-7
SLIDE 7

A tree-based technique

Idea 1: Use an AVL tree Suppose the tree has a height of 50. In the best case, how many disk accesses do we need to make? In the worst case? In the best case, the nodes we want happen to be stored in RAM, so we need zero accesses. In the worst case, each node is stored on a difgerent page on disk, so we need to make 50 accesses.

4

slide-8
SLIDE 8

M-ary search trees

Idea 1: ◮ Instead of having each node have 2 children, make it have M

  • children. Each node contains a sorted array of children nodes.

◮ Pick M so that each node fjts into a single page Example:

5

slide-9
SLIDE 9

M-ary search trees

Idea 1: ◮ Instead of having each node have 2 children, make it have M

  • children. Each node contains a sorted array of children nodes.

◮ Pick M so that each node fjts into a single page Example:

5

slide-10
SLIDE 10

M-ary search trees

◮ What is the height of an M-ary search tree in terms of M and n? Assume the tree is balanced. The height is approximately logM n . ◮ What is the worst-case runtime of get(...)? We need to examine logM n nodes. Per each node, we need to fjnd the child to pick. We can do so using binary search: log M Total runtime: height wordPerNode logM n log M .

6

slide-11
SLIDE 11

M-ary search trees

◮ What is the height of an M-ary search tree in terms of M and n? Assume the tree is balanced. The height is approximately logM(n). ◮ What is the worst-case runtime of get(...)? We need to examine logM(n) nodes. Per each node, we need to fjnd the child to pick. We can do so using binary search: log2(M) Total runtime: height · wordPerNode = logM(n) · log2(M).

6

slide-12
SLIDE 12

M-ary trees

With M-ary trees, how many disk accesses do we make, assuming each node is stored on one page? Is it logM(n), or logM(n) log2(M)? It’s logM n log M ! When doing binary search, we need to check the child to see if its key is the one we should pick.

7

slide-13
SLIDE 13

M-ary trees

With M-ary trees, how many disk accesses do we make, assuming each node is stored on one page? Is it logM(n), or logM(n) log2(M)? It’s logM(n) log2(M)! When doing binary search, we need to check the child to see if its key is the one we should pick.

7

slide-14
SLIDE 14

B-Trees

Idea 2: ◮ Rather then visiting each child, what if we stored the info we need in the parent – store keys? ◮ To avoid redundancy, store values only in leaf nodes. Internal node A node that stores only keys and pointers to children nodes Leaf node A node that stores only keys and values

8

slide-15
SLIDE 15

B-Trees

Idea 2: ◮ Rather then visiting each child, what if we stored the info we need in the parent – store keys? ◮ To avoid redundancy, store values only in leaf nodes. Internal node A node that stores only keys and pointers to children nodes Leaf node A node that stores only keys and values

8

slide-16
SLIDE 16

B-Trees

An example:

10 20 30 1 a 5 b 9 f 10 k 15 a 17 c 18 d 19 z 25 m 26 e 27 a 29 a 31 a 32 b 33 f

9

slide-17
SLIDE 17

B-Trees

A larger example (values in leaf nodes omitted):

15 40 4 10 1 2 3 5 6 7 10 11 12 13 14 15 20 25 30 11 12 13 15 17 19 20 21 24 25 26 27 28 29 31 33 37 45 60 33 36 40 44 46 50 55 57 58 60 70 100

10

slide-18
SLIDE 18

B-tree invariants

The B-tree invariants

  • 1. The B-tree node type invariant
  • 2. The B-tree order invariant
  • 3. The B-tree structure invariant

11

slide-19
SLIDE 19

The B-tree node type invariant

B-tree node type invariant A B-tree has two types of node: internal nodes, and leaf nodes.

12

slide-20
SLIDE 20

The B-tree node type invariant

B-tree internal node An internal node contains M pointers to children and M − 1 sorted keys. Note: M > 2 must be true. Example of internal node where M = 6:

K K K K

B-tree leaf node A leaf node contains L key-value pairs, sorted by key. Example

  • f leaf node where L

:

K V K V K V

Note: M and L are parameters the creator of the B-tree must pick

13

slide-21
SLIDE 21

The B-tree node type invariant

B-tree internal node An internal node contains M pointers to children and M − 1 sorted keys. Note: M > 2 must be true. Example of internal node where M = 6:

K K K K

B-tree leaf node A leaf node contains L key-value pairs, sorted by key. Example

  • f leaf node where L = 3:

K V K V K V

Note: M and L are parameters the creator of the B-tree must pick

13

slide-22
SLIDE 22

The B-tree order invariant

B-tree order invariant For any given key k, all subtrees to the left may only contain keys x that satisfy x < k. All subtrees to the right may only contain keys x that satisfy k ≥ k. This means the subtree between two adjacent keys a and b may

  • nly contain keys x that satisfy a ≤ x < b.

Example:

3 7 12 21 x < 3 3 ≤ x < 7 7 ≤ x < 12 12 ≤ x < 21 21 ≤ x

14

slide-23
SLIDE 23

The B-tree structure invariant

B-tree structure when n ≤ L If n ≤ L, the root node is a leaf:

12

B-tree structure when n L When n L, the root node MUST be an internal node containing to M children. All other internal nodes must have

M

to M children. All leaf nodes must have

L

to L children. In other words: all nodes must be at least half-full. The only exception is the root, which can have as few as 2 children.

15

slide-24
SLIDE 24

The B-tree structure invariant

B-tree structure when n ≤ L If n ≤ L, the root node is a leaf:

12

B-tree structure when n > L When n > L, the root node MUST be an internal node containing 2 to M children. All other internal nodes must have M

2

  • to M children.

All leaf nodes must have L

2

  • to L children.

In other words: all nodes must be at least half-full. The only exception is the root, which can have as few as 2 children.

15

slide-25
SLIDE 25

The B-tree structure invariant

B-tree structure when n ≤ L If n ≤ L, the root node is a leaf:

12

B-tree structure when n > L When n > L, the root node MUST be an internal node containing 2 to M children. All other internal nodes must have M

2

  • to M children.

All leaf nodes must have L

2

  • to L children.

In other words: all nodes must be at least half-full. The only exception is the root, which can have as few as 2 children.

15

slide-26
SLIDE 26

Why?

◮ Why must M > 2? Otherwise, we could end up with a linked list. ◮ Why do we insist almost all nodes must be at least half-full? It lets us ensure the tree stays balanced. ◮ Why is the root allowed to have as few as 2 children? If n is relatively small compared to M and L, it may not be possible for the root to actually be half-full.

16

slide-27
SLIDE 27

Why?

◮ Why must M > 2? Otherwise, we could end up with a linked list. ◮ Why do we insist almost all nodes must be at least half-full? It lets us ensure the tree stays balanced. ◮ Why is the root allowed to have as few as 2 children? If n is relatively small compared to M and L, it may not be possible for the root to actually be half-full.

16

slide-28
SLIDE 28

B-tree get

Try running get(6), get(39)

12 44 06 01 02 03 06 08 09 10 20 27 34 12 14 16 17 19 20 22 24 27 28 32 34 38 39 41 50 44 47 49 50 60 70

What’s the worst-case runtime of get(...)? Num disk accesses? Runtime roughly the same as M-ary trees: log L logM n log M . Number of disk accesses is logM n .

17

slide-29
SLIDE 29

B-tree get

Try running get(6), get(39)

12 44 06 01 02 03 06 08 09 10 20 27 34 12 14 16 17 19 20 22 24 27 28 32 34 38 39 41 50 44 47 49 50 60 70

What’s the worst-case runtime of get(...)? Num disk accesses? Runtime roughly the same as M-ary trees: log L logM n log M . Number of disk accesses is logM n .

17

slide-30
SLIDE 30

B-tree get

Try running get(6), get(39)

12 44 06 01 02 03 06 08 09 10 20 27 34 12 14 16 17 19 20 22 24 27 28 32 34 38 39 41 50 44 47 49 50 60 70

What’s the worst-case runtime of get(...)? Num disk accesses? Runtime roughly the same as M-ary trees: log2(L) + logM(n) log2(M). Number of disk accesses is logM(n).

17

slide-31
SLIDE 31

B-tree put

Suppose we have an empty B-tree where M = 3 and L = 3. Try inserting 3, 18, 14, 30: After inserting 3, 18, 14:

3 14 18

We want to insert 30, but leaf node is out of space. So, SPLIT the node:

18 3 14 18 30

18

slide-32
SLIDE 32

B-tree put

Suppose we have an empty B-tree where M = 3 and L = 3. Try inserting 3, 18, 14, 30: After inserting 3, 18, 14:

3 14 18

We want to insert 30, but leaf node is out of space. So, SPLIT the node:

18 3 14 18 30

18

slide-33
SLIDE 33

B-tree put

Suppose we have an empty B-tree where M = 3 and L = 3. Try inserting 3, 18, 14, 30: After inserting 3, 18, 14:

3 14 18

We want to insert 30, but leaf node is out of space. So, SPLIT the node:

18 3 14 18 30

18

slide-34
SLIDE 34

B-tree put

Next, try inserting 32 and 36.

18 3 14 18 30

After inserting 32:

18 3 14 18 30 32

We want to insert 36, but the leaf node is full! So, we SPLIT again:

18 32 3 14 18 30 32 36

19

slide-35
SLIDE 35

B-tree put

Next, try inserting 32 and 36. After inserting 32:

18 3 14 18 30 32

We want to insert 36, but the leaf node is full! So, we SPLIT again:

18 32 3 14 18 30 32 36

19

slide-36
SLIDE 36

B-tree put

Next, try inserting 32 and 36. After inserting 32:

18 3 14 18 30 32

We want to insert 36, but the leaf node is full! So, we SPLIT again:

18 32 3 14 18 30 32 36

19

slide-37
SLIDE 37

B-tree put

Next, try inserting 15 and 16.

18 32 3 14 18 30 32 36

After inserting 15:

18 32 3 14 15 18 30 32 36

We try inserting 16. The node is full, so we SPLIT:

3 14 18 32 15 16 18 30 32 36

What do we do now?

20

slide-38
SLIDE 38

B-tree put

Next, try inserting 15 and 16. After inserting 15:

18 32 3 14 15 18 30 32 36

We try inserting 16. The node is full, so we SPLIT:

3 14 18 32 15 16 18 30 32 36

What do we do now?

20

slide-39
SLIDE 39

B-tree put

Next, try inserting 15 and 16. After inserting 15:

18 32 3 14 15 18 30 32 36

We try inserting 16. The node is full, so we SPLIT:

3 14 18 32 15 16 18 30 32 36

What do we do now?

20

slide-40
SLIDE 40

B-tree put

Solution: Recursively split the parent!

15 3 14 15 16 32 18 30 32 36

Then create a new root!

18 15 3 14 15 16 32 18 30 32 36

21

slide-41
SLIDE 41

B-tree put

Solution: Recursively split the parent!

15 3 14 15 16 32 18 30 32 36

Then create a new root!

18 15 3 14 15 16 32 18 30 32 36

21

slide-42
SLIDE 42

B-tree put

Solution: Recursively split the parent!

15 3 14 15 16 32 18 30 32 36

Then create a new root!

18 15 3 14 15 16 32 18 30 32 36

21

slide-43
SLIDE 43

B-tree put

Now, try inserting 12, 40, 45, and 38.

18 15 3 14 15 16 32 18 30 32 36 18 15 3 12 14 15 16 32 40 18 30 32 36 38 40 45

Note: make sure to always fjll “signpost” with smallest value to right

22

slide-44
SLIDE 44

B-tree put

Now, try inserting 12, 40, 45, and 38.

18 15 3 12 14 15 16 32 40 18 30 32 36 38 40 45

Note: make sure to always fjll “signpost” with smallest value to right

22

slide-45
SLIDE 45

B-tree put

  • 1. Insert data in correct leaf in sorted order.
  • 2. If leaf has L

items, overfmow. Split leaf into two new nodes:

Original leaf gets L smaller items New leaf gets L larger items

Attach new child and key to the parent (preserving sorted

  • rder).
  • 3. Recursively continue overfmowing if necessary. Note: for

internal nodes, split using M instead of L.

  • 4. If root overfmows, make a new root.

23

slide-46
SLIDE 46

B-tree put

  • 1. Insert data in correct leaf in sorted order.
  • 2. If leaf has L + 1 items, overfmow.

Split leaf into two new nodes:

Original leaf gets L smaller items New leaf gets L larger items

Attach new child and key to the parent (preserving sorted

  • rder).
  • 3. Recursively continue overfmowing if necessary. Note: for

internal nodes, split using M instead of L.

  • 4. If root overfmows, make a new root.

23

slide-47
SLIDE 47

B-tree put

  • 1. Insert data in correct leaf in sorted order.
  • 2. If leaf has L + 1 items, overfmow.

Split leaf into two new nodes:

◮ Original leaf gets L + 1 2

  • smaller items

◮ New leaf gets L 2

  • larger items

Attach new child and key to the parent (preserving sorted

  • rder).
  • 3. Recursively continue overfmowing if necessary. Note: for

internal nodes, split using M instead of L.

  • 4. If root overfmows, make a new root.

23

slide-48
SLIDE 48

B-tree put

  • 1. Insert data in correct leaf in sorted order.
  • 2. If leaf has L + 1 items, overfmow.

Split leaf into two new nodes:

◮ Original leaf gets L + 1 2

  • smaller items

◮ New leaf gets L 2

  • larger items

Attach new child and key to the parent (preserving sorted

  • rder).
  • 3. Recursively continue overfmowing if necessary. Note: for

internal nodes, split using M instead of L.

  • 4. If root overfmows, make a new root.

23

slide-49
SLIDE 49

B-tree put

  • 1. Insert data in correct leaf in sorted order.
  • 2. If leaf has L + 1 items, overfmow.

Split leaf into two new nodes:

◮ Original leaf gets L + 1 2

  • smaller items

◮ New leaf gets L 2

  • larger items

Attach new child and key to the parent (preserving sorted

  • rder).
  • 3. Recursively continue overfmowing if necessary. Note: for

internal nodes, split using M instead of L.

  • 4. If root overfmows, make a new root.

23

slide-50
SLIDE 50

B-tree put analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: logM n log M ◮ Time to insert into leaf: L ◮ Time to split leaf: L ◮ Time to split parent: M ◮ Number of parents we might have to split: logM n Overall runtime: timeFindLeaf timeModifyLeaf timeModifyParents Putting it all together: logM n log M L M logM n L M logM n

24

slide-51
SLIDE 51

B-tree put analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: Θ (logM(n) log2(M)) ◮ Time to insert into leaf: Θ (L) ◮ Time to split leaf: Θ (L) ◮ Time to split parent: Θ (M) ◮ Number of parents we might have to split: Θ (logM(n)) Overall runtime: timeFindLeaf + timeModifyLeaf + timeModifyParents Putting it all together: logM n log M L M logM n L M logM n

24

slide-52
SLIDE 52

B-tree put analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: Θ (logM(n) log2(M)) ◮ Time to insert into leaf: Θ (L) ◮ Time to split leaf: Θ (L) ◮ Time to split parent: Θ (M) ◮ Number of parents we might have to split: Θ (logM(n)) Overall runtime: timeFindLeaf + timeModifyLeaf + timeModifyParents Putting it all together: Θ (logM(n) log2(M) + L + M logM(n)) = Θ (L + M logM(n))

24

slide-53
SLIDE 53

B-tree put analysis

Note: Runtime in the worst case is Θ (L + M logM(n)). However, splits are very rare! And splitting all the way to the root is even rarer. This means the average runtime is often better (often, just

  • r

L . And at the end of the day, number of disk accesses matter more: it’s still logM n no matter how many splits we do.

25

slide-54
SLIDE 54

B-tree put analysis

Note: Runtime in the worst case is Θ (L + M logM(n)). However, splits are very rare! And splitting all the way to the root is even rarer. This means the average runtime is often better (often, just Θ (1) or Θ (L). And at the end of the day, number of disk accesses matter more: it’s still Θ (logM(n)) no matter how many splits we do.

25

slide-55
SLIDE 55

B-tree remove

Now, try deleting 32 then 15. The starting B-tree:

18 15 3 12 14 15 16 32 40 18 30 32 36 38 40 45

After deleting 32:

18 15 3 12 14 15 16 32 40 18 30 36 38 40 45

26

slide-56
SLIDE 56

B-tree remove

Now, try deleting 32 then 15. The starting B-tree:

18 15 3 12 14 15 16 32 40 18 30 32 36 38 40 45

After deleting 32:

18 15 3 12 14 15 16 32 40 18 30 36 38 40 45

26

slide-57
SLIDE 57

B-tree remove

What happens if we try deleting 15? Problem: invariant is broken!

18 15 3 12 14 16 32 40 18 30 36 38 40 45

Solution: We fjx invariant by adopting a neighbor’s child!

18 15 3 12 14 16 32 40 18 30 36 38 40 45

27

slide-58
SLIDE 58

B-tree remove

What happens if we try deleting 15? Problem: invariant is broken!

18 15 3 12 14 16 32 40 18 30 36 38 40 45

Solution: We fjx invariant by adopting a neighbor’s child!

18 15 3 12 14 16 32 40 18 30 36 38 40 45

27

slide-59
SLIDE 59

B-tree remove

Now, try deleting 16. Problem: adopting would break invariant!

18 15 3 12 14 16 32 40 18 30 36 38 40 45 18 15 3 12 14 32 40 18 30 36 38 40 45

Solution: adopt recursively!

36 18 3 12 14 18 30 40 36 38 40 45

28

slide-60
SLIDE 60

B-tree remove

Now, try deleting 16. Problem: adopting would break invariant!

18 15 3 12 14 32 40 18 30 36 38 40 45

Solution: adopt recursively!

36 18 3 12 14 18 30 40 36 38 40 45

28

slide-61
SLIDE 61

B-tree remove

Now, try deleting 16. Problem: adopting would break invariant!

18 15 3 12 14 32 40 18 30 36 38 40 45

Solution: adopt recursively!

36 3 12 14 32 40 18 30 36 38 40 45 36 18 3 12 14 18 30 40 36 38 40 45

28

slide-62
SLIDE 62

B-tree remove

Now, try deleting 16. Problem: adopting would break invariant!

18 15 3 12 14 32 40 18 30 36 38 40 45

Solution: adopt recursively!

36 18 3 12 14 18 30 40 36 38 40 45

28

slide-63
SLIDE 63

B-tree remove

Now, try deleting 14 and 18. After deleting 14:

36 18 3 12 14 18 30 40 36 38 40 45 36 18 3 12 18 30 40 36 38 40 45

We try and delete 18....

36 3 12 18 40 36 38 40 45

29

slide-64
SLIDE 64

B-tree remove

Now, try deleting 14 and 18. After deleting 14:

36 18 3 12 18 30 40 36 38 40 45

We try and delete 18....

36 3 12 18 40 36 38 40 45

29

slide-65
SLIDE 65

B-tree remove

Problem: invariant is broken, adopting recursively doesn’t work:

36 3 12 18 40 36 38 40 45

Solution: Merge!

36 40 3 12 18 36 38 40 45

30

slide-66
SLIDE 66

B-tree remove

Problem: invariant is broken, adopting recursively doesn’t work:

36 3 12 18 40 36 38 40 45

Solution: Merge!

36 40 3 12 18 36 38 40 45

30

slide-67
SLIDE 67

B-tree remove

  • 1. Remove data from correct leaf
  • 2. If leaf has

L items, underfmow If neighbor has more then L , adopt one! Otherwise, merge with neighbor.

  • 3. If we merged, parent has one fewer child. Recursively

underfmow if necessary (note: for internal nodes, we use M instead of L).

  • 4. If we merge all the way up to the root and the root now has
  • nly one child, delete root and make child the root.

31

slide-68
SLIDE 68

B-tree remove

  • 1. Remove data from correct leaf
  • 2. If leaf has

L 2

  • items, underfmow

If neighbor has more then L , adopt one! Otherwise, merge with neighbor.

  • 3. If we merged, parent has one fewer child. Recursively

underfmow if necessary (note: for internal nodes, we use M instead of L).

  • 4. If we merge all the way up to the root and the root now has
  • nly one child, delete root and make child the root.

31

slide-69
SLIDE 69

B-tree remove

  • 1. Remove data from correct leaf
  • 2. If leaf has

L 2

  • items, underfmow

If neighbor has more then L 2

  • , adopt one!

Otherwise, merge with neighbor.

  • 3. If we merged, parent has one fewer child. Recursively

underfmow if necessary (note: for internal nodes, we use M instead of L).

  • 4. If we merge all the way up to the root and the root now has
  • nly one child, delete root and make child the root.

31

slide-70
SLIDE 70

B-tree remove

  • 1. Remove data from correct leaf
  • 2. If leaf has

L 2

  • items, underfmow

If neighbor has more then L 2

  • , adopt one!

Otherwise, merge with neighbor.

  • 3. If we merged, parent has one fewer child. Recursively

underfmow if necessary (note: for internal nodes, we use M instead of L).

  • 4. If we merge all the way up to the root and the root now has
  • nly one child, delete root and make child the root.

31

slide-71
SLIDE 71

B-tree remove

  • 1. Remove data from correct leaf
  • 2. If leaf has

L 2

  • items, underfmow

If neighbor has more then L 2

  • , adopt one!

Otherwise, merge with neighbor.

  • 3. If we merged, parent has one fewer child. Recursively

underfmow if necessary (note: for internal nodes, we use M instead of L).

  • 4. If we merge all the way up to the root and the root now has
  • nly one child, delete root and make child the root.

31

slide-72
SLIDE 72

B-tree remove analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: logM n log M ◮ Time to remove from leaf: L ◮ Time to adopt/merge with neighbor: L ◮ Time to adopt/merge in parent: M ◮ Number of parents we might have to fjx: logM n Putting it all together: L M logM n As before, average case runtime is frequently better because merges are very rare.

32

slide-73
SLIDE 73

B-tree remove analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: Θ (logM(n) log2(M)) ◮ Time to remove from leaf: Θ (L) ◮ Time to adopt/merge with neighbor: Θ (L) ◮ Time to adopt/merge in parent: Θ (M) ◮ Number of parents we might have to fjx: Θ (logM(n)) Putting it all together: Θ (L + M logM(n)) As before, average case runtime is frequently better because merges are very rare.

32

slide-74
SLIDE 74

B-tree remove analysis

What is the worst-case runtime? ◮ Time to fjnd correct leaf: Θ (logM(n) log2(M)) ◮ Time to remove from leaf: Θ (L) ◮ Time to adopt/merge with neighbor: Θ (L) ◮ Time to adopt/merge in parent: Θ (M) ◮ Number of parents we might have to fjx: Θ (logM(n)) Putting it all together: Θ (L + M logM(n)) As before, average case runtime is frequently better because merges are very rare.

32

slide-75
SLIDE 75

Picking M and L

Our original goal: make a disk-friendly dictionary. Why are B-trees so disk-friendly? All relevant information about a single node fjts in one page. We use as much of the page we can: each node contains many keys that are all brought in at once with a single disk access, basically “for free”. The time needed to do a binary search within a node is insignifjcant compared to disk access time.

33

slide-76
SLIDE 76

Picking M and L

Our original goal: make a disk-friendly dictionary. Why are B-trees so disk-friendly? ◮ All relevant information about a single node fjts in one page. We use as much of the page we can: each node contains many keys that are all brought in at once with a single disk access, basically “for free”. The time needed to do a binary search within a node is insignifjcant compared to disk access time.

33

slide-77
SLIDE 77

Picking M and L

Our original goal: make a disk-friendly dictionary. Why are B-trees so disk-friendly? ◮ All relevant information about a single node fjts in one page. ◮ We use as much of the page we can: each node contains many keys that are all brought in at once with a single disk access, basically “for free”. The time needed to do a binary search within a node is insignifjcant compared to disk access time.

33

slide-78
SLIDE 78

Picking M and L

Our original goal: make a disk-friendly dictionary. Why are B-trees so disk-friendly? ◮ All relevant information about a single node fjts in one page. ◮ We use as much of the page we can: each node contains many keys that are all brought in at once with a single disk access, basically “for free”. ◮ The time needed to do a binary search within a node is insignifjcant compared to disk access time.

33

slide-79
SLIDE 79

Picking M and L

So, how do we make sure a B-tree node actually fjts in one page? How do we pick M and L? Suppose we know the following:

  • 1. One key is k bytes
  • 2. One pointer is p bytes
  • 3. One value is v bytes

Two questions: What is the size of an internal node? Mp M k What is the size of a leaf node? L k v k

34

slide-80
SLIDE 80

Picking M and L

So, how do we make sure a B-tree node actually fjts in one page? How do we pick M and L? Suppose we know the following:

  • 1. One key is k bytes
  • 2. One pointer is p bytes
  • 3. One value is v bytes

Two questions: What is the size of an internal node? Mp M k What is the size of a leaf node? L k v k

34

slide-81
SLIDE 81

Picking M and L

So, how do we make sure a B-tree node actually fjts in one page? How do we pick M and L? Suppose we know the following:

  • 1. One key is k bytes
  • 2. One pointer is p bytes
  • 3. One value is v bytes

Two questions: ◮ What is the size of an internal node? Mp M k ◮ What is the size of a leaf node? L k v k

34

slide-82
SLIDE 82

Picking M and L

So, how do we make sure a B-tree node actually fjts in one page? How do we pick M and L? Suppose we know the following:

  • 1. One key is k bytes
  • 2. One pointer is p bytes
  • 3. One value is v bytes

Two questions: ◮ What is the size of an internal node? Mp + (M − 1)k ◮ What is the size of a leaf node? L(k + v)k

34

slide-83
SLIDE 83

Picking M and L

We know Mp + (M − 1)k is the size of one internal node, and L(k + v) is the size of a leaf node. Let’s say one page (aka one block) takes up B bytes. Goal: pick the largest M and L that satisfjes these two inequalities: Mp M k B L k v B If we do the math: M B k p k L B k v

35

slide-84
SLIDE 84

Picking M and L

We know Mp + (M − 1)k is the size of one internal node, and L(k + v) is the size of a leaf node. Let’s say one page (aka one block) takes up B bytes. Goal: pick the largest M and L that satisfjes these two inequalities: Mp + (M − 1)k ≤ B L(k + v) ≤ B If we do the math: M B k p k L B k v

35

slide-85
SLIDE 85

Picking M and L

We know Mp + (M − 1)k is the size of one internal node, and L(k + v) is the size of a leaf node. Let’s say one page (aka one block) takes up B bytes. Goal: pick the largest M and L that satisfjes these two inequalities: Mp + (M − 1)k ≤ B L(k + v) ≤ B If we do the math: M = B + k p + k

  • L =
  • B

k + v

  • 35
slide-86
SLIDE 86

Summary

What we’ve done so far: study difgerent dictionary implementations. These implementations all assume data is all stored in RAM. ◮ ArrayDictionary ◮ SortedArrayDictionary ◮ Binary search trees ◮ AVL trees ◮ Hash tables What if we have a lot of data that must be stored on disk? Use a B-tree, which we intentionally designed to take advantage of how memory is accesssed in computers.

36

slide-87
SLIDE 87

Summary

What we’ve done so far: study difgerent dictionary implementations. These implementations all assume data is all stored in RAM. ◮ ArrayDictionary ◮ SortedArrayDictionary ◮ Binary search trees ◮ AVL trees ◮ Hash tables What if we have a lot of data that must be stored on disk? Use a B-tree, which we intentionally designed to take advantage of how memory is accesssed in computers.

36

slide-88
SLIDE 88

Summary

What we’ve done so far: study difgerent dictionary implementations. These implementations all assume data is all stored in RAM. ◮ ArrayDictionary ◮ SortedArrayDictionary ◮ Binary search trees ◮ AVL trees ◮ Hash tables What if we have a lot of data that must be stored on disk? Use a B-tree, which we intentionally designed to take advantage of how memory is accesssed in computers.

36

slide-89
SLIDE 89

Summary

What you should know for midterm: ◮ The motivation behind why we made B-trees ◮ How to pick an optimal M and L ◮ A high level understanding of the B-tree invariants (e.g. be able to recognize when a B-tree is broken) ◮ The get algorithm What you should know for fjnal: The put and remove algorithms A more detailed understanding of the B-tree invariants

37

slide-90
SLIDE 90

Summary

What you should know for midterm: ◮ The motivation behind why we made B-trees ◮ How to pick an optimal M and L ◮ A high level understanding of the B-tree invariants (e.g. be able to recognize when a B-tree is broken) ◮ The get algorithm What you should know for fjnal: ◮ The put and remove algorithms ◮ A more detailed understanding of the B-tree invariants

37