database system architecture
play

Database System Architecture Index Structures Hector Garcia-Molina - PowerPoint PPT Presentation

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index structure Any data structure that takes as input a search key and efficiently returns the collection of matching records Sequential File 10 20


  1. Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren

  2. Index structure • Any data structure that takes as input a search key and efficiently returns the collection of matching records

  3. Sequential File 10 20 30 40 50 60 70 80 90 100

  4. Sequential File Dense Index 10 10 20 20 30 30 40 40 50 50 60 60 70 80 70 80 90 100 90 110 100 120

  5. Sequential File Sparse Index 10 10 20 30 50 30 70 40 90 50 110 60 130 150 70 80 170 190 90 210 100 230

  6. Sequential File Sparse 2nd level 10 10 10 20 90 30 170 50 30 250 70 40 90 330 50 110 410 60 130 490 150 70 570 80 170 190 90 210 100 230

  7. Question: • Can we build a dense, 2nd level index for a dense index?

  8. Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file (Later: – sparse better for insertions – dense needed for secondary indexes)

  9. Next: • Duplicate keys • Deletion/Insertion • Secondary indexes

  10. Duplicate keys 10 10 10 20 20 30 30 30 40 45

  11. Duplicate keys Dense index, one way to implement? 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 30 30 30 30 30 30 30 30 30 30 30 30 40 40 45 45

  12. Duplicate keys Dense index, better way? 10 10 10 20 10 30 20 40 20 30 30 30 40 45

  13. Duplicate keys Sparse index, one way? careful if looking 10 10 10 for 20 or 30! 10 10 20 20 30 20 30 30 30 40 45

  14. Duplicate keys Sparse index, another way? place first new key from block 10 10 10 20 10 30 20 30 20 30 30 30 40 45

  15. Deletion from sparse index 10 10 20 30 30 50 40 70 50 90 60 110 130 70 150 80

  16. Deletion from sparse index ● delete record 40 10 10 20 30 30 50 40 70 50 90 60 110 130 70 150 80

  17. Deletion from sparse index ● delete record 30 10 10 20 40 30 40 30 50 40 70 50 90 60 110 130 70 150 80

  18. Deletion from sparse index ● delete records 30 & 40 10 10 20 50 30 30 50 70 40 70 50 90 60 110 130 70 150 80

  19. Deletion from dense index 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80

  20. Deletion from dense index ● delete record 30 10 10 20 20 30 40 40 30 40 40 50 50 60 60 70 70 80 80

  21. Insertion, sparse index case 10 10 20 30 30 40 60 40 50 60

  22. Insertion, sparse index case ● insert record 34 10 10 20 30 30 40 34 60 40 50 60 our lucky day! we have free space where we need it!

  23. Insertion, sparse index case ● insert record 15 10 10 20 15 20 30 30 20 40 30 60 40 50 • Illustrated: Immediate 60 reorganization • Variation: ● insert new block (chained file) ● update index

  24. Insertion, sparse index case ● insert record 25 10 25 10 20 30 30 overflow blocks 40 60 (reorganize later...) 40 50 60

  25. Insertion, dense index case • Similar • Often more expensive . . .

  26. Secondary indexes Sequence field 30 50 20 70 80 40 100 10 90 60

  27. Secondary indexes Sequence • Sparse index field 30 30 50 20 80 20 100 70 80 90 40 ... 100 10 does not make sense! 90 60

  28. Secondary indexes Sequence • Dense index field 10 30 20 50 30 10 20 40 50 70 90 50 ... 80 60 40 sparse 70 ... high 100 10 level 90 60

  29. With secondary indexes: • Lowest level is dense • Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)

  30. Duplicate values & secondary indexes 20 10 20 40 10 40 10 40 30 40

  31. Duplicate values & secondary indexes one option... 20 10 10 10 10 Problem: 20 20 40 excess overhead! 20 ● disk space 10 30 40 40 ● search time 40 10 40 40 40 30 ... 40

  32. Duplicate values & secondary indexes another option... 20 10 10 20 Problem: 20 40 variable size 10 records in 40 30 40 index! 10 40 30 40

  33. Duplicate values & secondary indexes 20 10 10 20 20 30 40 40 10 50 40 60 ... 10 40 30 40 buckets

  34. Why “bucket” idea is useful Indexes Records Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary

  35. Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. index EMP Floor index Toy 2nd → Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

  36. This idea used in text information retrieval Document cat s ...the cat is fat ... dog ...was raining cats and dogs... ...Fido the dog ... Inverted lists

  37. IR QUERIES • Find articles with “cat” and “dog” • Find articles with “cat” or “dog” • Find articles with “cat” and not “dog” • Find articles with “cat” in title • Find articles with “cat” and “dog” within 5 words

  38. Summary so far • Conventional index – Basic Ideas: sparse, dense, multi- level… – Duplicate Keys – Deletion/Insertion – Secondary indexes – Buckets of Postings List

  39. Outline/summary • Conventional Indexes • Sparse vs. dense • Primary vs. secondary • B trees --> Next • B+trees vs. indexed sequential • Hashing schemes

  40. Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance

  41. Example Index (sequential) 10 39 20 31 30 35 33 36 continuous 40 50 60 32 38 34 free space 70 80 90 overflow area (not sequential)

  42. • NEXT: Another type of index – Give up on sequentiality of index – Try to get “balance”

  43. B+Tree Example 3 5 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 n=3 180 200

  44. Sample non-leaf 57 81 95 to keys to keys to keys to keys 81 ≤ k<95 ≥ 95 57 ≤ k<81 < 57

  45. Sample leaf node: 57 To record with key 57 To record 81 with key 81 From non-leaf node To record 95 with key 85 in sequence to next leaf

  46. In textbook’s notation n=3 Leaf: 30 35 30 35 Non-leaf: 30 30

  47. 3 5 Lookup record(s) with key = 35 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 180 200 n=3

  48. 3 5 Lookup record(s) with key = 40 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 180 200 n=3

  49. Range query: lookup record(s) with 35 <= key <= 150 n=3 Root 100 120 150 180 30 100 101 110 120 130 150 156 179 180 200 11 30 35 3 5

  50. • The I/O cost of a lookup in a BTree is equal to longest path of the root to a leaf • Hence, the goal is to keep this longest path as short as possible • In particular: we want all leafs to be at the same depth in the tree (and hence want a balanced tree)

  51. Size of nodes: n+1 pointers (fixed) n keys

  52. Don’t want nodes to be too empty • Use at least  (n+1)/2  Non-leaf: pointers Leaf :  (n+1)/2  pointers to data

  53. Leaf Non-leaf n=3 node Full node 3 120 5 150 11 180 30 30 35 min. counts even if null

  54. B+tree rules tree of order d (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

  55. (3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrs → data keys Non-leaf  (n+1)/ 2   (n+1)/ 2  - 1 (non-root) n+1 n Leaf  (n+ 1) / 2   (n+ 1) / 2  (non-root) n+1 n Root n+1 n 2 1

  56. Insert into B+tree (a) simple case – space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

  57. (a) Insert key = 32 n=3 100 30 11 30 31 32 3 5

  58. (a) Insert key = 7 n=3 100 30 7 7 11 30 31 3 5 3 5

  59. (c) Insert key = 160 100 160 150 120 156 150 179 180 160 179 180 n=3 180 200

  60. (d) New root, insert n=3 45 new root 30 10 20 30 40 10 12 20 25 30 32 40 40 45 1 2 3

  61. Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf

  62. (b) Coalesce with n=4 sibling – Delete 50 100 10 40 40 10 20 30 40 50

  63. (c) Redistribute keys n=4 – Delete 50 35 100 10 40 35 10 20 30 35 40 50

  64. (d) Non-leaf coalese n=4 – Delete 37 25 new root 40 25 10 20 30 40 30 30 37 10 14 20 22 25 26 40 45 1 3

  65. Outline/summary • Conventional Indexes • Sparse vs. dense • Primary vs. secondary • B trees • B+trees vs. indexed sequential • Hashing schemes --> Next

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend