Hashing and Birthdays Todays announcements: PA2 out, due Nov 1, - - PowerPoint PPT Presentation

hashing and birthdays
SMART_READER_LITE
LIVE PREVIEW

Hashing and Birthdays Todays announcements: PA2 out, due Nov 1, - - PowerPoint PPT Presentation

Hashing and Birthdays Todays announcements: PA2 out, due Nov 1, 23:59 MT2 Nov 7, 19:00-21:00 WOOD 2 Todays Plan Hashing Birthdays and probability Warm up: Thinking about AVL trees AVL trees are binary search trees that


slide-1
SLIDE 1

Hashing and Birthdays

Today’s announcements:

◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2

Today’s Plan

◮ Hashing ◮ Birthdays and probability

Warm up: Thinking about AVL trees

◮ AVL trees are binary search trees that allow only slight

imbalance

◮ Worst-case O(log n) time for find, insert, and remove ◮ Elements (even siblings) may be scattered in memory

Could we preserve optimal balance always? 5 3 2 4 7 6

1 / 10

slide-2
SLIDE 2

Dictionary ADT

Operations

◮ insert ◮ remove ◮ find

key value (data) Multics MULTiplexed Information and Computing Service Unix Uniplexed Multics BSD Berkeley Software Distribution GNU GNU’s Not Unix

◮ insert(Linux, Linus Torvald’s Unix) ◮ find(Unix) returns “Uniplexed Multics”

2 / 10

slide-3
SLIDE 3

Hash Table Goal

We can do: a[2]=“GNU’s Not Unix”

1 2 3 m − 1

GNU’s Not Unix

We want to do: a[“GNU”]=“GNU’s Not Unix”

Multics Linux GNU Unix Unics GNU’s Not Unix

3 / 10

slide-4
SLIDE 4

Hash table approach

Choose a hash function to map keys to indices.

Multics Linux GNU Unix Unics

GNU’s Not Unix

1 2 3 m − 1 hash function hash table keys

hash(“GNU”) = 2

4 / 10

slide-5
SLIDE 5

Collisions

A collision occurs when two different keys x and y map to the same index (i.e. slot in table), hash(x) = hash(y).

Multics Linux GNU Unix Unics GNU’s Not Unix

1 2 3 m − 1 hash function

Mac OS X

hash table

Can we prevent collisions?

5 / 10

slide-6
SLIDE 6

Birthdays and Probability

Probability that someone in this room has a birthday today?

What if this was a birthday party?

Probability that two people in this room have the same birthday?

What if the room contained 366 people? 183?

6 / 10

slide-7
SLIDE 7

Expected Value

Definition: The expected value of a number X that depends on random events (X is called a random variable) is: E[X] =

  • x

Prob[X = x] · x. X is the sum of two six-sided dice. E[X] =

1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12

Linearity of Expectation For any two random variables X and Y , E[X + Y ] = E[X] + E[Y ].

7 / 10

slide-8
SLIDE 8

More Birthdays

What is the expected number of people who share a birthday in this room? Let Xij =

  • 1

if person i and j have same birthday

  • therwise

X =

i<j Xij is the number of pairs who share a birthday.

E[X] = E[

i<j Xij] = i<j E[Xij] = i<j

Generalized birthdays

If we randomly put k people into m bins, we expect 1

m k(k−1) 2

pairs to share a bin, which is greater than 1 for k = √ 2m + 1.

8 / 10

slide-9
SLIDE 9

Hashing string keys with mod and Horner’s Rule

int hash( string s ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be a power of 2 int hash( char *s ) { int h = *s++; while( *s ) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different?

9 / 10

slide-10
SLIDE 10

Fixed hash functions are dangerous!

Good hash table performance depends on few collisions. If a user knows your hash function, she can cause many elements to hash to the same slot. Why would she want to do that? Yacc h(s) = (31k−1s[0] + 31k−2s[1] + · · · + 310s[k − 1])mod1023 h(XY) = h(xy). Find many strings that hash to the same slot?

Protection

◮ Use a cryptographically secure hash function (e.g. SHA-512). ◮ Choose a new hash function at random for every hash table.

10 / 10