Database System Architecture Index Structures Hector Garcia-Molina - - PowerPoint PPT Presentation
Database System Architecture Index Structures Hector Garcia-Molina - - PowerPoint PPT Presentation
Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index structure Any data structure that takes as input a search key and efficiently returns the collection of matching records Sequential File 10 20
Index structure
- Any data structure that takes as
input a search key and efficiently returns the collection of matching records
Sequential File 20 10 40 30 60 50 80 70 100 90
Sequential File 20 10 40 30 60 50 80 70 100 90 Dense Index
10 20 30 40 50 60 70 80 90 100 110 120
Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse Index
10 30 50 70 90 110 130 150 170 190 210 230
Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse 2nd level
10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570
Question:
- Can we build a dense, 2nd level
index for a dense index?
Sparse vs. Dense Tradeoff
- Sparse: Less index space per record
can keep more of index in memory
- Dense: Can tell if any record exists
without accessing file
(Later:
– sparse better for insertions – dense needed for secondary indexes)
Next:
- Duplicate keys
- Deletion/Insertion
- Secondary indexes
Duplicate keys
10 10 20 10 30 20 30 30 45 40
10 10 20 10 30 20 30 30 45 40
10 10 10 20 20 30 30 30
10 10 20 10 30 20 30 30 45 40
10 10 10 20 20 30 30 30
Dense index, one way to implement? Duplicate keys
10 10 20 10 30 20 30 30 45 40
10 20 30 40
Dense index, better way? Duplicate keys
10 10 20 10 30 20 30 30 45 40
10 10 20 30
Sparse index, one way? Duplicate keys
careful if looking for 20 or 30!
10 10 20 10 30 20 30 30 45 40
10 20 30 30
Sparse index, another way? Duplicate keys
place first new key from block
Deletion from sparse index
20 10 40 30 60 50 80 70
10 30 50 70 90 110 130 150
Deletion from sparse index
20 10 40 30 60 50 80 70
10 30 50 70 90 110 130 150
- delete record 40
Deletion from sparse index
20 10 40 30 60 50 80 70
10 30 50 70 90 110 130 150
- delete record 30
40 40
Deletion from sparse index
20 10 40 30 60 50 80 70
10 30 50 70 90 110 130 150
- delete records 30 & 40
50 70
Deletion from dense index
20 10 40 30 60 50 80 70
10 20 30 40 50 60 70 80
Deletion from dense index
20 10 40 30 60 50 80 70
10 20 30 40 50 60 70 80
- delete record 30
40 40
Insertion, sparse index case
20 10 30 50 40 60
10 30 40 60
Insertion, sparse index case
20 10 30 50 40 60
10 30 40 60
- insert record 34
34
- ur lucky day!
we have free space where we need it!
Insertion, sparse index case
20 10 30 50 40 60
10 30 40 60
- insert record 15
15 20 30 20
- Illustrated: Immediate
reorganization
- Variation:
- insert new block (chained file)
- update index
Insertion, sparse index case
20 10 30 50 40 60
10 30 40 60
- insert record 25
25
- verflow blocks
(reorganize later...)
Insertion, dense index case
- Similar
- Often more expensive . . .
Secondary indexes
Sequence field
50 30 70 20 40 80 10 100 60 90
Secondary indexes
Sequence field
50 30 70 20 40 80 10 100 60 90
- Sparse index
30 20 80 100 90 ...
does not make sense!
Secondary indexes
Sequence field
50 30 70 20 40 80 10 100 60 90
- Dense index
10 20 30 40 50 60 70 ... 10 50 90 ...
sparse high level
With secondary indexes:
- Lowest level is dense
- Other levels are sparse
Also: Pointers are record pointers
(not block pointers; not computed)
Duplicate values & secondary indexes
10 20 40 20 40 10 40 10 40 30
Duplicate values & secondary indexes
10 20 40 20 40 10 40 10 40 30
10 10 10 20 20 30 40 40 40 40 ...
- ne option...
Problem: excess overhead!
- disk space
- search time
Duplicate values & secondary indexes
10 20 40 20 40 10 40 10 40 30
10
another option...
40 30 20
Problem: variable size records in index!
Duplicate values & secondary indexes
10 20 40 20 40 10 40 10 40 30
10 20 30 40 50 60 ...
buckets
Why “bucket” idea is useful
Indexes Records Name: primary EMP(name,dept,floor,...) Dept: secondary Floor: secondary
Query: Get employees in (Toy Dept) ^ (2nd floor)
- Dept. index
EMP Floor index Toy 2nd
→ Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s
This idea used in text information retrieval
Document s
...the cat is fat ... ...was raining cats and dogs... ...Fido the dog ... Inverted lists cat dog
IR QUERIES
- Find articles with “cat” and “dog”
- Find articles with “cat” or “dog”
- Find articles with “cat” and not
“dog”
- Find articles with “cat” in title
- Find articles with “cat” and “dog”
within 5 words
Summary so far
- Conventional index
– Basic Ideas: sparse, dense, multi- level… – Duplicate Keys – Deletion/Insertion – Secondary indexes
– Buckets of Postings List
Outline/summary
- Conventional Indexes
- Sparse vs. dense
- Primary vs. secondary
- B trees
- -> Next
- B+trees vs. indexed sequential
- Hashing schemes
Conventional indexes
Advantage:
- Simple
- Index is sequential file
good for scans Disadvantage:
- Inserts expensive, and/or
- Lose sequentiality & balance
Example Index (sequential)
continuous free space 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33
- verflow area
(not sequential)
- NEXT: Another type of index
– Give up on sequentiality of index – Try to get “balance”
Root
B+Tree Example n=3
100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200
Sample non-leaf
to keys to keys to keys to keys < 57 57≤ k<81 81≤k<95 ≥95 57 81 95
Sample leaf node:
From non-leaf node to next leaf in sequence 57 81 95
To record with key 57 To record with key 81 To record with key 85
In textbook’s notation n=3
Leaf: Non-leaf:
30 35 30 30 35 30
Lookup record(s) with key = 35 n=3
Root
100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200
Lookup record(s) with key = 40 n=3
Root
100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200
Range query: lookup record(s) with 35 <= key <= 150 n=3
Root
100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200
- The I/O cost of a lookup in a BTree is
equal to longest path of the root to a leaf
- Hence, the goal is to keep this longest
path as short as possible
- In particular: we want all leafs to be at
the same depth in the tree (and hence want a balanced tree)
Size of nodes: n+1 pointers n keys
(fixed)
Don’t want nodes to be too empty
- Use at least
Non-leaf: (n+1)/2 pointers Leaf : (n+1)/2 pointers to data
Full node min. node Non-leaf Leaf
n=3
120 150 180 30 3 5 11 30 35
counts even if null
B+tree rules tree of order d
(1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”
(3) Number of pointers/keys for B+tree
Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n Root n+1 n 2 1 Max Max Min Min ptrs keys ptrs→data keys (n+1)/2
(n+1)/2
Insert into B+tree
(a) simple case
– space available in leaf
(b) leaf overflow (c) non-leaf overflow (d) new root
(a) Insert key = 32
n=3
3 5 11 30 31 30 100 32
(a) Insert key = 7
n=3
3 5 11 30 31 30 100 3 5 7 7
(c) Insert key = 160
n=3
100 120 150 180 150 156 179 180 200 160 180 160 179
(d) New root, insert 45
n=3
10 20 30 1 2 3 10 12 20 25 30 32 40 40 45 40 30 new root
(a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf
Deletion from B+tree
(b) Coalesce with sibling
– Delete 50
10 40 100 10 20 30 40 50
n=4
40
(c) Redistribute keys
– Delete 50
10 40 100 10 20 30 35 40 50
n=4
35 35
40 45 30 37 25 26 20 22 10 14 1 3 10 20 30 40
(d) Non-leaf coalese
– Delete 37
n=4
40 30 25 25
new root
Outline/summary
- Conventional Indexes
- Sparse vs. dense
- Primary vs. secondary
- B trees
- B+trees vs. indexed sequential
- Hashing schemes
- -> Next