6
play

= 6 13 14 15 16 17 18 19 20 1 / 12 Expected Value - PowerPoint PPT Presentation

Hashing Todays announcements: PA2 out, due Nov 1, 23:59 MT2 Nov 7, 19:00-21:00 WOOD 2 Todays Plan 0 1 Hashing 2 3 Universal hash functions 4 5 6 12345 7 8 9 10 11 12 = 6 13 14 15 16 17 18 19 20 1 / 12


  1. Hashing Today’s announcements: ◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2 Today’s Plan 0 1 ◮ Hashing 2 3 ◮ Universal hash functions 4 5 6 12345 7 8 9 10 11 12 = 6 13 14 15 16 17 18 19 20 1 / 12

  2. Expected Value Definition: The expected value of a number X that depends on random events ( X is called a random variable ) is: � E [ X ] = Prob [ X = x ] · x . x X is the sum of two six-sided dice. E [ X ] = 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Linearity of Expectation For any two random variables X and Y , E [ X + Y ] = E [ X ] + E [ Y ]. 2 / 12

  3. More Birthdays What is the expected number of people who share a birthday in this room? � 1 if person i and j have same birthday Let X ij = 0 otherwise X = � i < j X ij is the number of pairs who share a birthday. E [ X ] = E [ � i < j X ij ] = � i < j E [ X ij ] = � i < j Generalized birthdays k ( k − 1) If we randomly put k people into m bins, we expect 1 pairs √ m 2 to share a bin, which is greater than 1 for k = 2 m + 1. 3 / 12

  4. Hash table approach Choose a hash function to map keys to indices. keys hash table 0 GNU 1 Linux 2 GNU’s Not Unix Multics 3 Unics Unix m − 1 hash function hash(“GNU”) = 2 4 / 12

  5. Hashing string keys with mod and Horner’s Rule int hash( string s ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be a power of 2 int hash( char *s ) { int h = *s++; while( *s ) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different? 5 / 12

  6. Fixed hash functions are dangerous! Good hash table performance depends on few collisions. If a user knows your hash function, she can cause many elements to hash to the same slot. Why would she want to do that? Yacc h ( s ) = (31 k − 1 s [0] + 31 k − 2 s [1] + · · · + 31 0 s [ k − 1])mod1023 h ( XY ) = h ( xy ). Find many strings that hash to the same slot? Protection ◮ Use a cryptographically secure hash function (e.g. SHA-512). ◮ Choose a new hash function at random for every hash table. 6 / 12

  7. Universal hash functions A set H of hash functions is universal if for all x � = y , the probability that hash( x ) = hash( y ) is at most 1 / m when hash() is chosen at random from H . Example: Let p be a prime number larger than any key. Choose a at random from { 1 , 2 , . . . , p − 1 } and choose b at random from { 0 , 1 , . . . , p − 1 } . hash( x ) = (( a · x + b ) mod p ) mod m 7 / 12

  8. Collisions Birthday Paradox With probability > , two people, in a room of 23, have the same birthday. (Hash 23 people into m = 365 slots. Collision?) General birthday paradox √ If we randomly hash 2 m keys into m slots, we get a collision with probability > . Collision Unless we know all the keys in advance and design a perfect hash function, we must handle collisions. What do we do when two keys hash to the same slot? ◮ separate chaining: store multiple items in each slot ◮ open addressing: pick a next slot to try 8 / 12

  9. Hashing with Chaining Store multiple items in each slot. How? ◮ Common choice is an unordered linked list 0 (a chain). A D 1 ◮ Could use any dictionary ADT 2 implementation. E B 3 Result 4 ◮ Can hash more than m items into a table 5 of size m . C 6 ◮ Performance depends on the length of the chains. ◮ Memory is allocated on each insertion. hash( A ) = hash( D ) = 1 hash( E ) = hash( B ) = 3 9 / 12

  10. Access time for Chaining Load Factor α = # hashed items = n table size m Assume we have a uniform hash function (every item hashes to each slot with equal probability). Search cost On average, ◮ an unsuccessful search examines items. ◮ a successful search examines 1 + n − 1 2 m = 1 + α 2 − α 2 n items. We want the load factor to be small. 10 / 12

  11. Open Addressing Allow only one item in each slot. The hash function specifies a sequence of slots to try. Insert If the first slot is occupied, try the next, then the next, ... until an empty slot is found. 0 A Find If the first slot doesn’t match, try the 1 next, then the next, ... until a match (found) D 2 or an empty slot (not found). E 3 Result B 4 ◮ Cannot hash more than m items into a 5 table of size m . [Pigeonhole Principle] C 6 ◮ Hash table memory allocated once. ◮ Performance depends on number of trys. 11 / 12

  12. Linear probing Try location (hash( k ) + i ) mod m for i = 0 , 1 , ... insert(76) insert(93) insert(40) insert(47) insert(10) insert(55) 76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6 47 47 47 0 0 0 0 0 0 55 1 1 1 1 1 1 93 93 93 93 93 2 2 2 2 2 2 10 10 3 3 3 3 3 3 4 4 4 4 4 4 40 40 40 40 5 5 5 5 5 5 6 76 6 76 6 76 6 76 6 76 6 76 here hash( k ) = k %7 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend