Hashing
Today’s announcements:
◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2
Today’s Plan
◮ Hashing ◮ Universal hash functions
= 6
12345 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 / 12
= 6 13 14 15 16 17 18 19 20 1 / 12 Expected Value - - PowerPoint PPT Presentation
Hashing Todays announcements: PA2 out, due Nov 1, 23:59 MT2 Nov 7, 19:00-21:00 WOOD 2 Todays Plan 0 1 Hashing 2 3 Universal hash functions 4 5 6 12345 7 8 9 10 11 12 = 6 13 14 15 16 17 18 19 20 1 / 12
Today’s announcements:
◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2
Today’s Plan
◮ Hashing ◮ Universal hash functions
12345 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 / 12
Definition: The expected value of a number X that depends on random events (X is called a random variable) is: E[X] =
Prob[X = x] · x. X is the sum of two six-sided dice. E[X] =
1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
Linearity of Expectation For any two random variables X and Y , E[X + Y ] = E[X] + E[Y ].
2 / 12
What is the expected number of people who share a birthday in this room? Let Xij =
if person i and j have same birthday
X =
i<j Xij is the number of pairs who share a birthday.
E[X] = E[
i<j Xij] = i<j E[Xij] = i<j
Generalized birthdays
If we randomly put k people into m bins, we expect 1
m k(k−1) 2
pairs to share a bin, which is greater than 1 for k = √ 2m + 1.
3 / 12
Choose a hash function to map keys to indices.
Multics Linux GNU Unix Unics
GNU’s Not Unix
1 2 3 m − 1 hash function hash table keys
hash(“GNU”) = 2
4 / 12
int hash( string s ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be a power of 2 int hash( char *s ) { int h = *s++; while( *s ) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different?
5 / 12
Good hash table performance depends on few collisions. If a user knows your hash function, she can cause many elements to hash to the same slot. Why would she want to do that? Yacc h(s) = (31k−1s[0] + 31k−2s[1] + · · · + 310s[k − 1])mod1023 h(XY) = h(xy). Find many strings that hash to the same slot?
Protection
◮ Use a cryptographically secure hash function (e.g. SHA-512). ◮ Choose a new hash function at random for every hash table.
6 / 12
A set H of hash functions is universal if for all x = y, the probability that hash(x) = hash(y) is at most 1/m when hash() is chosen at random from H. Example: Let p be a prime number larger than any key. Choose a at random from {1, 2, . . . , p − 1} and choose b at random from {0, 1, . . . , p − 1}. hash(x) = ((a · x + b) mod p) mod m
7 / 12
Birthday Paradox
With probability > , two people, in a room of 23, have the same birthday. (Hash 23 people into m = 365 slots. Collision?)
General birthday paradox
If we randomly hash √ 2m keys into m slots, we get a collision with probability > .
Collision
Unless we know all the keys in advance and design a perfect hash function, we must handle collisions. What do we do when two keys hash to the same slot?
◮ separate chaining: store multiple items in each slot ◮ open addressing: pick a next slot to try
8 / 12
Store multiple items in each slot. How?
◮ Common choice is an unordered linked list
(a chain).
◮ Could use any dictionary ADT
implementation. Result
◮ Can hash more than m items into a table
◮ Performance depends on the length of the
chains.
◮ Memory is allocated on each insertion.
1 2 3 4 5 6 A D E B C
hash(A) = hash(D) = 1 hash(E) = hash(B) = 3
9 / 12
Load Factor α = # hashed items table size = n m Assume we have a uniform hash function (every item hashes to each slot with equal probability).
Search cost
On average,
◮ an unsuccessful search examines
items.
◮ a successful search examines 1 + n−1 2m = 1 + α 2 − α 2n items.
We want the load factor to be small.
10 / 12
Allow only one item in each slot. The hash function specifies a sequence of slots to try. Insert If the first slot is occupied, try the next, then the next, ... until an empty slot is found. Find If the first slot doesn’t match, try the next, then the next, ... until a match (found)
Result
◮ Cannot hash more than m items into a
table of size m. [Pigeonhole Principle]
◮ Hash table memory allocated once. ◮ Performance depends on number of trys.
1 2 3 4 5 6 A D E B C
11 / 12
Try location (hash(k) + i) mod m for i = 0, 1, ...
1 2 3 4 5 6 76 1 2 3 4 5 6 76 1 2 3 4 5 6 76 1 2 3 4 5 6 76 1 2 3 4 5 6 76 1 2 3 4 5 6 76
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55) 76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6
93 93 93 93 93 40 40 40 40 47 47 47 10 10 55 here hash(k) = k%7
12 / 12