hashing
play

Hashing 0 Hash function. Method for computing array index from - PowerPoint PPT Presentation

Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hashing 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 Tyler Moore 3 "it" ?? 4 CS 2123, The


  1. Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hashing 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 Tyler Moore 3 "it" ?? 4 CS 2123, The University of Tulsa Issues. hash("times") = 3 5 ・ Computing the hash function. ・ Equality test: Method for checking whether two keys are equal. ・ Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index. Some slides created by or adapted from Dr. Kevin Wayne. For more information see http://www.cs.princeton.edu/courses/archive/fall12/cos226/lectures.php . Classic space-time tradeoff. ・ No space limitation: trivial hash function with key as index. ・ No time limitation: trivial collision resolution with sequential search. ・ Space and time limitations: hashing (the real world). 3 2 / 22 Computing the hash function Uniform hashing assumption Idealistic goal. Scramble the keys uniformly to produce a table index. Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1 . ・ Efficiently computable. key ・ Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Bins and balls. Throw balls uniformly at random into M bins. Ex 1. Phone numbers. ・ Bad: first three digits. table ・ Better: last three digits. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 index Ex 2. Social Security numbers. Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses. ・ Bad: first three digits. 573 = California, 574 = Alaska (assigned in chronological order within geographic region) ・ Better: last three digits. Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses. Load balancing. After M tosses, expect most loaded bin has Practical challenge. Need different approach for each key type. Θ ( log M / log log M ) balls. 5 13 3 / 22 4 / 22

  2. Options for dealing with collisions Collisions Collision. Two distinct keys hashing to same index. ・ Birthday problem ⇒ can't avoid collisions unless you have a ridiculous (quadratic) amount of memory. 1 Open hashing aka separate chaining: store collisions in a linked list ・ Coupon collector + load balancing ⇒ collisions are evenly distributed. 2 Closed hashing aka open addressing: keep keys in the table, shift to unused space Collision resolution policies 0 1 Linear probing 1 hash("it") = 3 2 Quadratic probing aka quadratic residue search 2 3 "it" Double hashing 3 ?? 4 hash("times") = 3 5 Challenge. Deal with collisions efficiently. 16 5 / 22 6 / 22 Separate chaining symbol table Analysis of separate chaining Use an array of M < N linked lists. [H. P . Luhn, IBM 1953] Proposition. Under uniform hashing assumption, prob. that the number of ・ Hash: map key to integer i between 0 and M - 1 . keys in a list is within a constant factor of N / M is extremely close to 1 . ・ Insert: put at front of i th chain (if not already there). ・ Search: need to search only i th chain. Pf sketch. Distribution of list size obeys a binomial distribution. key hash value (10, .12511...) .125 S 2 0 E 0 1 A 8 E 12 0 A 0 2 30 0 10 20 R 4 3 st[] null Binomial distribution ( N = 10 4 , M = 10 3 , � = 10) C 4 4 0 1 H 4 5 2 X 7 S 0 E 0 6 3 equals() and hashCode() X 2 7 4 L 11 P 10 A 0 8 Consequence. Number of probes for search/insert is proportional to N / M . M 4 9 ・ M too large ⇒ too many empty chains. P 3 10 M 9 H 5 C 4 R 3 ・ M too small ⇒ chains too long. M times faster than L 3 11 sequential search ・ Typical choice: M ~ N / 5 ⇒ constant-time ops. E 0 12 17 20 7 / 22 8 / 22

  3. Closed hashing Closed hashing: insert Records stored directly in table of size M at hash index h ( x ) for key x Hash(key) into table at position i When a collision occurs: Repeat up to the size of the table { Hashes to occupied home position If entry at position i in table is blank or marked as deleted Record stored in first available slot based on repeatable collision then insert and exit resolution policy Let i be the next position using the collision resolution function Formally, for each i collisions h 0 ( x ) , h 1 ( x ) , . . . h i ( x ) tried in succession } where h i ( x ) = ( h ( x ) + f ( i )) mod M 9 / 22 10 / 22 Closed hashing: search Closed hashing: delete Hash(key) into table at position i Hash(key) into table at position i Repeat up to the size of the table { Repeat up to the size of the table { If entry at position i in table matches key and not marked as deleted If entry at position i in table matches key then found and exit then mark as deleted and exit If entry at position i in table is blank If entry at position i in table is blank then not found and exit then not found and exit Let i be the next position using the collision resolution function Let i be the next position using the collision resolution function } Not found and exit } Not found and exit 11 / 22 12 / 22

  4. Linear probing Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. Collision resolution function f ( i ) = i : h i ( x ) = ( h ( x ) + i ) mod M Work example 28 13 / 22 14 / 22 Knuth's parking problem Analysis of linear probing Model. Cars arrive at one-way street with M parking spaces. Proposition. Under uniform hashing assumption, the average # of probes Each desires a random space i : if space i is taken, try i + 1, i + 2, etc. in a linear probing hash table of size M that contains N = α M keys is: ∼ 1 � 1 � ∼ 1 � 1 � Q. What is mean displacement of a car? 1 + 1 + 2 1 − α 2 (1 − α ) 2 search hit search miss / insert displacement = 3 Pf. Half-full. With M / 2 cars, mean displacement is ~ 3 / 2 . Full. With M cars, mean displacement is ~ π M / 8 . Parameters. ・ M too large ⇒ too many empty array entries. ・ M too small ⇒ search time blows up. # probes for search hit is about 3/2 ・ Typical choice: α = N / M ~ ½ . # probes for search miss is about 5/2 29 30 15 / 22 16 / 22

  5. Performance comparison of search Load factors and cost of probing Tree Worst-case cost Avg.-case cost What size hash table do we need when using linear probing and a load (after n inserts) (after n inserts) Ordered factor of α = 0 . 75 for closed hashing to achieve a more efficient search insert delete search insert delete iteration? expected search time than a balanced binary search tree? Sequential search (unordered list) Θ( n ) Θ( n ) Θ( n ) Θ( n ) Θ( n ) Θ( n ) no Search hit: 1 1 2 (1 + 1 − 3 / 4 ) = 2 . 5 Binary search (ordered array) Θ(log( n )) Θ( n ) Θ( n ) Θ(log( n )) Θ( n ) Θ( n ) yes Search miss/insert: 1 1 2 (1 + (1 − 3 / 4) 2 ) = 8 . 5 BST Θ( n ) Θ( n ) Θ( n ) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes AVL Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes Thus we need a hash table of size M where log 2 M = 8 . 5, so B-tree Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes M ≥ 2 8 . 5 = 362 Hash table Θ( n ) Θ( n ) Θ( n ) Θ(1) Θ(1) Θ(1) no 17 / 22 18 / 22 Load factors and cost of probing Quadratic probing 50 1000000000000 Collision resolution function f ( i ) = ± i 2 : h i ( x ) = ( h ( x ) ± i 2 ) mod M expected # probes insert/search miss breakeven input size hash table/BST 20 for 1 ≤ i ≤ ( M − 1) 2 M is a prime number of the form 4 j + 3, which guarantees that the 10 100000000 probe sequence is a permutation of the table address space alpha = 0.9 5 Eliminates primary clustering (when collisions group together causing 10000 alpha = 0.9 more collisions for keys that hash to different values) 2 Work example 1 1 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 load factor alpha load factor alpha 19 / 22 20 / 22

  6. Double hashing Rehashing We have already seen how hash table performance falls rapidly as the table load factor approaches 1 (in practice, any load factor above 1/2 With quadratic probing, secondary clustering remains: keys that should be avoided) collide must follow sequence of prior collisions to find an open spot To rehash: create a new table whose capacity M ′ is the first prime Double hashing reduces both primary and secondary clustering: probe more than twice as large as M sequence is dependent on original key, not just one hash value Scan through the old table and insert into the new table, ignoring Collision resolution function f ( i ) = i · h b ( x ): cells marked as deleted h i ( x ) = ( h A ( x ) + i · h B ( x )) mod M Running time Θ( M ) Works best if M is prime Relatively expensive operation on its own Our approach: h A ( x ) = x mod M , h B ( x ) = R − ( x mod R ) where But good hash table implementations will only rehash when the table is R is a prime < M . half full, then double in size, so the operation should be rare Can even consider the cost amortized over the M / 2 insertions as constant addition to the insertions 21 / 22 22 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend