hash based indexes
play

Hash-Based Indexes UMass Amherst March 6, 2008 Slides Courtesy of - PowerPoint PPT Presentation

Hash-Based Indexes UMass Amherst March 6, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Introduction As for any index, 3 alternatives for data entries k* : Data record with key value k < k , rid of data record with


  1. Hash-Based Indexes UMass Amherst March 6, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1

  2. Introduction � As for any index, 3 alternatives for data entries k* : � Data record with key value k � < k , rid of data record with search key value k > � < k , list of rids of data records with search key k > � Choice orthogonal to the indexing technique � Hash-based indexes are best for equality selections . Cannot support range searches. � Static and dynamic hashing techniques exist; trade-offs for dynamic data 2

  3. Static Hashing 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages � h ( k ) mod N = bucket to which data entry with key k belongs . k1 � k2 can lead to the same bucket. � Static: # buckets (N) fixed � main pages allocated sequentially, never de-allocated; � overflow pages if needed. 3

  4. Static Hashing (Contd.) � Hash fn works on search key field of record r. Must distribute values over range 0 ... N-1. � h ( key ) mod N = (a * key + b) mod N usually works well. � a and b are constants; lots known about how to tune h . � Buckets contain data entries . � Long overflow chains can develop and degrade performance. � Extendible and Linear Hashing : Dynamic techniques to fix this problem. 4

  5. Extendible Hashing � Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? � Reading and writing all pages is expensive! � Idea : Use directory of pointers to buckets , double # of buckets by (1) doubling the directory, (2) splitting just the bucket that overflowed! � Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page ! � Trick lies in how hash function is adjusted! 5

  6. Example LOCAL DEPTH L 2 Bucket A 4* 12* 32* 16* GLOBAL DEPTH D � Directory is array of size 4, global depth D = 2. 2 2 � Each bucket has local depth L Bucket B 00 1* 5* 21* 13* (L � D) 01 � To find bucket for r , (1) get 2 10 Bucket C h ( r ) , (2) take last ` global 10* 11 depth ’ # bits of h ( r ). 2 � If h ( r ) = 5 = binary 101, DIRECTORY Bucket D 15* 7* 19* � Take last 2 bits, go to bucket pointed to by 01. DATA Entry PAGES 6

  7. Inserts LOCAL DEPTH L 2 Bucket A 12* 32* 16* 4* � If bucket is full, split it GLOBAL DEPTH D ( allocate new page, re- 2 2 distribute ). Bucket B 00 1* 5* 21* 13* � If necessary , double the 01 directory. Splitting or not can 2 10 be decided by comparing Bucket C 10* 11 global depth and local depth for the split bucket. 2 DIRECTORY � Split if global depth = Bucket D 15* 7* 19* local depth. DATA PAGES � Don’t otherwise. Insert r with h( r )=20? 7

  8. Insert h (r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 32*16* GLOBAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 1* 5* 21*13* 00 1* 5* 21*13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 4* 20* Bucket A2 (`split image' of Bucket A) (`split image' of Bucket A) 8

  9. Points to Note � 20 = binary 10100. Last 2 bits (00) tell us r belongs in A or A2. Last 3 bits needed to tell which. � Global depth of directory : Max # of bits needed to tell which bucket an entry belongs to. � Local depth of a bucket : # of bits used to determine if an entry belongs to this bucket. � When does bucket split cause directory doubling? � Before insert, local depth of bucket = global depth . Insert causes local depth to become > global depth ; directory is doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!) 9

  10. Directory Doubling ( inserting 8* ) 2 2 0001 2 2 0100 1100 00 00 2 0100 01 2 0001 01 2 1001 1001 � 10 10 1011 11 11 2 1011 2 1100 3 3 000 3 1000 2 0001 000 001 2 0001 001 2 010 0100 1001 � 010 011 3 1000 2 1011 011 100 1001 100 101 3 0100 3 1011 101 1100 110 � 110 2 1100 111 111 Least Significant vs. Most Significant 10

  11. Comments on Extendible Hashing � If directory fits in memory, equality search answered with one disk access; else two. � 100MB file, 100 bytes/rec, 4K pages � 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. � Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. � Entries with same key value (duplicates) need overflow pages! � Delete : removal of data entry from bucket � If bucket is empty, can be merged with `split image’. � If each directory element points to same bucket as its split image, can halve directory. 11

  12. Summary � Hash-based indexes: best for equality searches, cannot support range searches. � Static Hashing can lead to long overflow chains. � Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. ( But duplicates may require overflow pages. ) � Directory to keep track of buckets, doubles periodically. � Can get large with skewed data; additional I/O if this does not fit in main memory. 12

  13. Tree-Structured Indexes CMPSCI 645 Mar 6, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1

  14. Review � As for any index, 3 alternatives for data entries k* : � Data record with key value k � < k , rid of data record with search key value k > � < k , list of rids of data records with search key k > � Choice is orthogonal to the indexing technique used to locate data entries k* . � Tree-structured indexing techniques support both range searches and equality searches . 2

  15. B+ Tree: Most Widely Used Index � Inserts/deletes keep tree height-balanced . Log F N cost (F = fanout, N = # leaf pages). � Minimum 50% occupancy (except for root). Each node contains d <= m <= 2 d entries, where d is called the order of the tree. � Supports equality, range-searches, updates efficiently. Index Entries (Direct search) Data Entries ("Sequence set") 3

  16. Example B+ Tree � Search begins at root, and key comparisons direct it to a leaf. � Search for 5*, 15*, all data entries >= 24* ... Root 30 13 17 24 33* 34* 38* 39* 3* 5* 19* 20* 22* 24* 27* 29* 2* 7* 14* 16* � Based on the search for 15*, we know it is not in the tree! 4

  17. B+ Trees in Practice � Typical order: 200. Typical fill-factor: 67%. � average fanout = 133 � Typical capacities: � Height 4: 133 4 = 312,900,700 records � Height 3: 133 3 = 2,352,637 records � Can often hold top levels in buffer pool: � Level 1 = 1 page = 8 Kbytes � Level 2 = 133 pages = 1 Mbyte � Level 3 = 17,689 pages = 133 MBytes 5

  18. Inserting a Data Entry into a B+ Tree � Find correct leaf L. � Put data entry onto L . � If L has enough space, done ! � Else, must split L (into L and a new node L2) • Redistribute entries evenly, copy up middle key. • Insert index entry pointing to L2 into parent of L . � This can happen recursively � To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) � Splits “grow” tree; root split increases height. � Tree growth: gets wider or one level taller at top. 6

  19. Previous Example Inserting 8* Root 30 13 17 24 33* 34* 38* 39* 3* 5* 19* 20* 22* 24* 27* 29* 2* 7* 14* 16* 7

  20. Inserting 8* into Example B+ Tree Entry to be inserted in parent node. � Observe how (Note that 5 is s copied up and 5 continues to appear in the leaf.) minimum occupancy is 3* 5* 2* 7* 8* guaranteed in both leaf and index pg splits. � Note difference Entry to be inserted in parent node. (Note that 17 is pushed up and only between copy- 17 appears once in the index. Contrast this with a leaf split.) up and push-up ; be sure you 5 13 24 30 understand the reasons for this. 8

  21. Example B+ Tree After Inserting 8* Root 17 24 5 13 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 19* 20* 22* 24* 27* 29* 14* 16* � Notice that root was split, leading to increase in height. � In this example, we can avoid split by re-distributing entries between siblings; but not usually done in practice. 9

  22. Deleting a Data Entry from a B+ Tree � Start at root, find leaf L where entry belongs. � Remove the entry. � If L is at least half-full, done! � If L has only d-1 entries, •Try to re-distribute, borrowing from sibling (adjacent node with same parent as L) . •If re-distribution fails, merge L and sibling. � If merge occurred, must delete entry (pointing to L or sibling) from parent of L . � Merge could propagate to root, decreasing height. 10

  23. Current B+ Tree Delete 19* Delete 20* Root 17 24 5 13 30 2* 3* 5* 7* 8* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 14* 16* 11

  24. Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ... Root 17 27 5 13 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16* � Deleting 19* is easy. � Deleting 20* is done with re-distribution. Notice how middle key is copied up . 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend