Database System Architecture Index Structures Hector Garcia-Molina - - PowerPoint PPT Presentation

database system architecture
SMART_READER_LITE
LIVE PREVIEW

Database System Architecture Index Structures Hector Garcia-Molina - - PowerPoint PPT Presentation

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index structure Any data structure that takes as input a search key and efficiently returns the collection of matching records Sequential File 10 20


slide-1
SLIDE 1

Database System Architecture

Index Structures

Hector Garcia-Molina Stijn Vansummeren

slide-2
SLIDE 2

Index structure

  • Any data structure that takes as

input a search key and efficiently returns the collection of matching records

slide-3
SLIDE 3

Sequential File 20 10 40 30 60 50 80 70 100 90

slide-4
SLIDE 4

Sequential File 20 10 40 30 60 50 80 70 100 90 Dense Index

10 20 30 40 50 60 70 80 90 100 110 120

slide-5
SLIDE 5

Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse Index

10 30 50 70 90 110 130 150 170 190 210 230

slide-6
SLIDE 6

Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse 2nd level

10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570

slide-7
SLIDE 7

Question:

  • Can we build a dense, 2nd level

index for a dense index?

slide-8
SLIDE 8

Sparse vs. Dense Tradeoff

  • Sparse: Less index space per record

can keep more of index in memory

  • Dense: Can tell if any record exists

without accessing file

(Later:

– sparse better for insertions – dense needed for secondary indexes)

slide-9
SLIDE 9

Next:

  • Duplicate keys
  • Deletion/Insertion
  • Secondary indexes
slide-10
SLIDE 10

Duplicate keys

10 10 20 10 30 20 30 30 45 40

slide-11
SLIDE 11

10 10 20 10 30 20 30 30 45 40

10 10 10 20 20 30 30 30

10 10 20 10 30 20 30 30 45 40

10 10 10 20 20 30 30 30

Dense index, one way to implement? Duplicate keys

slide-12
SLIDE 12

10 10 20 10 30 20 30 30 45 40

10 20 30 40

Dense index, better way? Duplicate keys

slide-13
SLIDE 13

10 10 20 10 30 20 30 30 45 40

10 10 20 30

Sparse index, one way? Duplicate keys

careful if looking for 20 or 30!

slide-14
SLIDE 14

10 10 20 10 30 20 30 30 45 40

10 20 30 30

Sparse index, another way? Duplicate keys

place first new key from block

slide-15
SLIDE 15

Deletion from sparse index

20 10 40 30 60 50 80 70

10 30 50 70 90 110 130 150

slide-16
SLIDE 16

Deletion from sparse index

20 10 40 30 60 50 80 70

10 30 50 70 90 110 130 150

  • delete record 40
slide-17
SLIDE 17

Deletion from sparse index

20 10 40 30 60 50 80 70

10 30 50 70 90 110 130 150

  • delete record 30

40 40

slide-18
SLIDE 18

Deletion from sparse index

20 10 40 30 60 50 80 70

10 30 50 70 90 110 130 150

  • delete records 30 & 40

50 70

slide-19
SLIDE 19

Deletion from dense index

20 10 40 30 60 50 80 70

10 20 30 40 50 60 70 80

slide-20
SLIDE 20

Deletion from dense index

20 10 40 30 60 50 80 70

10 20 30 40 50 60 70 80

  • delete record 30

40 40

slide-21
SLIDE 21

Insertion, sparse index case

20 10 30 50 40 60

10 30 40 60

slide-22
SLIDE 22

Insertion, sparse index case

20 10 30 50 40 60

10 30 40 60

  • insert record 34

34

  • ur lucky day!

we have free space where we need it!

slide-23
SLIDE 23

Insertion, sparse index case

20 10 30 50 40 60

10 30 40 60

  • insert record 15

15 20 30 20

  • Illustrated: Immediate

reorganization

  • Variation:
  • insert new block (chained file)
  • update index
slide-24
SLIDE 24

Insertion, sparse index case

20 10 30 50 40 60

10 30 40 60

  • insert record 25

25

  • verflow blocks

(reorganize later...)

slide-25
SLIDE 25

Insertion, dense index case

  • Similar
  • Often more expensive . . .
slide-26
SLIDE 26

Secondary indexes

Sequence field

50 30 70 20 40 80 10 100 60 90

slide-27
SLIDE 27

Secondary indexes

Sequence field

50 30 70 20 40 80 10 100 60 90

  • Sparse index

30 20 80 100 90 ...

does not make sense!

slide-28
SLIDE 28

Secondary indexes

Sequence field

50 30 70 20 40 80 10 100 60 90

  • Dense index

10 20 30 40 50 60 70 ... 10 50 90 ...

sparse high level

slide-29
SLIDE 29

With secondary indexes:

  • Lowest level is dense
  • Other levels are sparse

Also: Pointers are record pointers

(not block pointers; not computed)

slide-30
SLIDE 30

Duplicate values & secondary indexes

10 20 40 20 40 10 40 10 40 30

slide-31
SLIDE 31

Duplicate values & secondary indexes

10 20 40 20 40 10 40 10 40 30

10 10 10 20 20 30 40 40 40 40 ...

  • ne option...

Problem: excess overhead!

  • disk space
  • search time
slide-32
SLIDE 32

Duplicate values & secondary indexes

10 20 40 20 40 10 40 10 40 30

10

another option...

40 30 20

Problem: variable size records in index!

slide-33
SLIDE 33

Duplicate values & secondary indexes

10 20 40 20 40 10 40 10 40 30

10 20 30 40 50 60 ...

buckets

slide-34
SLIDE 34

Why “bucket” idea is useful

Indexes Records Name: primary EMP(name,dept,floor,...) Dept: secondary Floor: secondary

slide-35
SLIDE 35

Query: Get employees in (Toy Dept) ^ (2nd floor)

  • Dept. index

EMP Floor index Toy 2nd

→ Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

slide-36
SLIDE 36

This idea used in text information retrieval

Document s

...the cat is fat ... ...was raining cats and dogs... ...Fido the dog ... Inverted lists cat dog

slide-37
SLIDE 37

IR QUERIES

  • Find articles with “cat” and “dog”
  • Find articles with “cat” or “dog”
  • Find articles with “cat” and not

“dog”

  • Find articles with “cat” in title
  • Find articles with “cat” and “dog”

within 5 words

slide-38
SLIDE 38

Summary so far

  • Conventional index

– Basic Ideas: sparse, dense, multi- level… – Duplicate Keys – Deletion/Insertion – Secondary indexes

– Buckets of Postings List

slide-39
SLIDE 39

Outline/summary

  • Conventional Indexes
  • Sparse vs. dense
  • Primary vs. secondary
  • B trees
  • -> Next
  • B+trees vs. indexed sequential
  • Hashing schemes
slide-40
SLIDE 40

Conventional indexes

Advantage:

  • Simple
  • Index is sequential file

good for scans Disadvantage:

  • Inserts expensive, and/or
  • Lose sequentiality & balance
slide-41
SLIDE 41

Example Index (sequential)

continuous free space 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33

  • verflow area

(not sequential)

slide-42
SLIDE 42
  • NEXT: Another type of index

– Give up on sequentiality of index – Try to get “balance”

slide-43
SLIDE 43

Root

B+Tree Example n=3

100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

slide-44
SLIDE 44

Sample non-leaf

to keys to keys to keys to keys < 57 57≤ k<81 81≤k<95 ≥95 57 81 95

slide-45
SLIDE 45

Sample leaf node:

From non-leaf node to next leaf in sequence 57 81 95

To record with key 57 To record with key 81 To record with key 85

slide-46
SLIDE 46

In textbook’s notation n=3

Leaf: Non-leaf:

30 35 30 30 35 30

slide-47
SLIDE 47

Lookup record(s) with key = 35 n=3

Root

100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

slide-48
SLIDE 48

Lookup record(s) with key = 40 n=3

Root

100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

slide-49
SLIDE 49

Range query: lookup record(s) with 35 <= key <= 150 n=3

Root

100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

slide-50
SLIDE 50
  • The I/O cost of a lookup in a BTree is

equal to longest path of the root to a leaf

  • Hence, the goal is to keep this longest

path as short as possible

  • In particular: we want all leafs to be at

the same depth in the tree (and hence want a balanced tree)

slide-51
SLIDE 51

Size of nodes: n+1 pointers n keys

(fixed)

slide-52
SLIDE 52

Don’t want nodes to be too empty

  • Use at least

Non-leaf: (n+1)/2 pointers Leaf : (n+1)/2 pointers to data

slide-53
SLIDE 53

Full node min. node Non-leaf Leaf

n=3

120 150 180 30 3 5 11 30 35

counts even if null

slide-54
SLIDE 54

B+tree rules tree of order d

(1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

slide-55
SLIDE 55

(3) Number of pointers/keys for B+tree

Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n Root n+1 n 2 1 Max Max Min Min ptrs keys ptrs→data keys (n+1)/2

(n+1)/2

slide-56
SLIDE 56

Insert into B+tree

(a) simple case

– space available in leaf

(b) leaf overflow (c) non-leaf overflow (d) new root

slide-57
SLIDE 57

(a) Insert key = 32

n=3

3 5 11 30 31 30 100 32

slide-58
SLIDE 58

(a) Insert key = 7

n=3

3 5 11 30 31 30 100 3 5 7 7

slide-59
SLIDE 59

(c) Insert key = 160

n=3

100 120 150 180 150 156 179 180 200 160 180 160 179

slide-60
SLIDE 60

(d) New root, insert 45

n=3

10 20 30 1 2 3 10 12 20 25 30 32 40 40 45 40 30 new root

slide-61
SLIDE 61

(a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf

Deletion from B+tree

slide-62
SLIDE 62

(b) Coalesce with sibling

– Delete 50

10 40 100 10 20 30 40 50

n=4

40

slide-63
SLIDE 63

(c) Redistribute keys

– Delete 50

10 40 100 10 20 30 35 40 50

n=4

35 35

slide-64
SLIDE 64

40 45 30 37 25 26 20 22 10 14 1 3 10 20 30 40

(d) Non-leaf coalese

– Delete 37

n=4

40 30 25 25

new root

slide-65
SLIDE 65

Outline/summary

  • Conventional Indexes
  • Sparse vs. dense
  • Primary vs. secondary
  • B trees
  • B+trees vs. indexed sequential
  • Hashing schemes
  • -> Next