hash based indexes
play

Hash-Based Indexes Database Management Systems, R. Ramakrishnan and - PDF document

Hash-Based Indexes Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Introduction As for any index, 3 alternatives for data entries k* : Data record with key value k < k , rid of data record with search key value k >


  1. Hash-Based Indexes Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Introduction � As for any index, 3 alternatives for data entries k* : � Data record with key value k � < k , rid of data record with search key value k > � < k , list of rids of data records with search key k > � Hash-based indexes are best for equality selections . – Provide constant-time searches – But cannot support range searches � Static and dynamic hashing techniques exist – Trade-offs similar to ISAM vs. B+ trees Database Management Systems, R. Ramakrishnan and J. Gehrke 2 Static Hashing � # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. � h ( k ) mod N = bucket to which data entry with key k belongs . (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages Database Management Systems, R. Ramakrishnan and J. Gehrke 3

  2. Static Hashing (Contd.) � Buckets contain data entries . � Hash fn works on search key field of record r. Must distribute values over range 0 ... N-1. – h ( key ) = (a * key + b) usually works well. – a and b are constants; lots known about how to tune h . � Long overflow chains can develop and degrade performance – Extendible and Linear Hashing : Dynamic techniques to fix this problem. Database Management Systems, R. Ramakrishnan and J. Gehrke 4 Extendible Hashing � Main idea: If bucket (primary page) becomes full, why not re-organize file by doubling # of buckets? – Essentially “splitting” buckets � But reading and writing all buckets is expensive! – Idea : Use directory of pointers to buckets , – Double # of buckets by doubling the directory, splitting just the bucket that overflowed! – Directory much smaller than file, so doubling it is much cheaper. – No overflow pages ! Database Management Systems, R. Ramakrishnan and J. Gehrke 5 Insert h (r)=14 2 2 LOCAL DEPTH LOCAL DEPTH Bucket A Bucket A 32*16* 4* 12* 32*16* 4* 12* GLOBAL DEPTH GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 1* 5* 21*13* 00 1* 5* 21*13* 01 01 2 2 10 10 Bucket C Bucket C 10* 11 10* 11 14* 2 2 DIRECTORY DIRECTORY Bucket D Bucket D 15* 7* 19* 15* 7* 19* Database Management Systems, R. Ramakrishnan and J. Gehrke 6

  3. Insert h (r)=20 2 2 LOCAL DEPTH LOCAL DEPTH Bucket A Bucket A 32* 16* 32* 16* GLOBAL DEPTH 4* 12* GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 1* 5* 21*13* 00 1* 5* 21*13* 01 01 2 10 10 2 Bucket C Bucket C 11 10* 11 10* 2 2 DIRECTORY Bucket D DIRECTORY Bucket D 15* 7* 19* 15* 7* 19* 2 Bucket A2 4* 12* 20* (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 7 Insert h (r)=20 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 32* 16* GLOBAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 2 3 2 Bucket B 00 1* 5* 21*13* 000 1* 5* 21*13* Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* 011 Bucket C 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY (`split image' 4* 12* 20* Bucket A2 of Bucket A) (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 8 Insert h (r)=32 LOCAL DEPTH LOCAL DEPTH GLOBAL DEPTH GLOBAL DEPTH 1 Bucket A 1 4* 12* 10* 32* 0 0 Bucket A 0 1* 10* 4* 12* 1 1 DIRECTORY Bucket B 1* DIRECTORY Database Management Systems, R. Ramakrishnan and J. Gehrke 9

  4. Insert h (r)=16 LOCAL DEPTH 2 LOCAL DEPTH Bucket A 32* 16* 4* 12* GLOBAL DEPTH GLOBAL DEPTH 1 1 2 Bucket A 1 4* 12* 10* 32* Bucket B 1* 00 0 01 1 1 2 10 Bucket C Bucket B 10* 11 1* DIRECTORY DIRECTORY Database Management Systems, R. Ramakrishnan and J. Gehrke 10 Insert h (r)=20 3 LOCAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 LOCAL DEPTH 3 1 Bucket A 32* 16* 4* 12* GLOBAL DEPTH 000 1* Bucket B 001 1 2 2 010 Bucket B 1* 00 10* 011 Bucket C 01 100 2 10 101 Bucket C 10* 11 110 111 DIRECTORY 3 DIRECTORY 4* 12* 20* Bucket A2 (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 11 Insert h (r)=5, 15, 7, 19 3 3 LOCAL DEPTH LOCAL DEPTH 32* 16* Bucket A 32* 16* Bucket A GLOBAL DEPTH GLOBAL DEPTH 3 1 3 2 5* 15* 7* 000 1* Bucket B 000 1* 5* Bucket B 001 001 2 2 010 010 10* Bucket C 10* Bucket C 011 011 100 100 2 101 101 15* 7* 19* Bucket B2 110 110 (`split image' 111 111 of Bucket B) 3 3 DIRECTORY DIRECTORY 4* 12* 20* Bucket A2 4* 12* 20* Bucket A2 (`split image' (`split image' of Bucket A) of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 12

  5. Deletions � Inverse of insertion � If removal of data entry makes bucket empty, merge with ‘split image’ � If each directory element points to same bucket as its split image, can halve directory Database Management Systems, R. Ramakrishnan and J. Gehrke 13 Comments on Extendible Hashing � If directory fits in memory, equality search answered with one disk access; else two – 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. � Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large – Multiple entries with same hash value cause problems! – When would this happen? Database Management Systems, R. Ramakrishnan and J. Gehrke 14 Linear Hashing � This is another dynamic hashing scheme, an alternative to Extendible Hashing � LH handles the problem of long overflow chains without using a directory, and handles duplicates � Main idea: split one bucket at a time in rounds Database Management Systems, R. Ramakrishnan and J. Gehrke 15

  6. Inserting h(r) = 43 Level=2, N=4 Level=2 h PRIMARY h OVERFLOW h h PRIMARY 3 2 PAGES 3 2 Next=0 Next=0 PAGES PAGES 32* 44* 36* 32* 000 00 000 00 Data entry r 9* 25* 5* 9* 25* 5* with h(r)=5 001 01 001 01 14* 18*10*30* 14* 18*10*30* 010 10 Primary 010 10 bucket page 31*35* 7* 11* 31*35* 7* 11* 43* 011 11 011 11 ( This info (The actual contents is for illustration of the linear hashed only!) file) Database Management Systems, R. Ramakrishnan and J. Gehrke 16 Example (Inserting h(r) = 43) Level=2 Level=2 h h PRIMARY OVERFLOW h h OVERFLOW PRIMARY 3 2 PAGES PAGES 3 2 Next=0 PAGES PAGES 32* 32* 000 00 000 00 Next=1 9* 25* 5* 9* 25* 5* 001 01 001 01 14* 18*10*30* 14* 18*10*30* 010 10 010 10 31*35* 7* 11* 43* 31*35* 7* 11* 43* 011 11 011 11 100 00 44* 36* Database Management Systems, R. Ramakrishnan and J. Gehrke 17 Inserting h(r) = 50 (End of a Round) Level=3 PRIMARY OVERFLOW h3 h PAGES 2 PAGES Next=0 Level=2 000 00 32* PRIMARY OVERFLOW PAGES h3 h PAGES 2 001 01 9* 25* 32* 000 00 010 10 66* 18* 10* 34* 50* 9* 25* 001 01 011 11 43* 35* 11* 010 10 66* 18* 10* 34* Next=3 100 00 44* 36* 31* 35* 7* 11* 43* 011 11 101 11 5* 37* 29* 100 44* 36* 00 14* 30* 22* 5* 110 10 101 37*29* 01 14* 30* 22* 31*7* 111 11 110 10 Database Management Systems, R. Ramakrishnan and J. Gehrke 18

  7. Overview of LH File � In the middle of a round. Buckets split in this round: Bucket to be split If h ( search key value ) Level Next is in this range, must use h Level+1 ( search key value ) Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round Database Management Systems, R. Ramakrishnan and J. Gehrke 19 Summary � Hash-based indexes: best for equality searches, cannot support range searches. � Static Hashing can lead to long overflow chains. � Extendible Hashing uses directory doubling to avoid overflow pages – Duplicates may require overflow pages � Linear hashing avoids directory by splitting in rounds – Naturally handles skew and duplicates – Uses overflow buckets (but not very long in practice) Database Management Systems, R. Ramakrishnan and J. Gehrke 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend