Hash tables Most data structures that were going to see are about - - PowerPoint PPT Presentation

hash tables
SMART_READER_LITE
LIVE PREVIEW

Hash tables Most data structures that were going to see are about - - PowerPoint PPT Presentation

Hash tables Most data structures that were going to see are about storing and manipulating data When only operations Insert , Search and Delete as dictionary op- erations are needed, hash tables can be quite good Many variations of hash tables


slide-1
SLIDE 1

Hash tables

Most data structures that we’re going to see are about storing and manipulating data When only operations Insert, Search and Delete as dictionary op- erations are needed, hash tables can be quite good Many variations of hash tables (or rather the functions implementing them), from not-so-fast but simple to extremely fast but complicated Elements are pairs (key,data); keys distinct Intuition: you have some, say, “clever array”, and

  • Insert(elem) inserts elem somewhere into the array
  • Search(elem) knows where elem is stored and returns correspond-

ing data

  • Delete(elem) also knows where elem is and removes it

1

slide-2
SLIDE 2

Actual “positions” (somehow) depend on keys Important: we want to maintain dynamic set (insertions and dele- tions) Given universe U = {0, . . . , u − 1} for some (typically large) u Keys are from U Simple approach: Array of size |U|, operations are straightforward, element with key k is stored in slot k But what if K, set of keys actually stored, is much much smaller than U? Waste of memory (but time-efficient)

2

slide-3
SLIDE 3

What we want is to reduce size of table Hashing: element with key k is stored in slot h(k); we use hash function h to compute slot We hope to be able to reduce size of table, say, m: h : U → {0, . . . , m − 1} for some m << |U| We say element with key k hashes into slot h(k), and that h(k) is hash value of k

  • But. . . two or more keys may hash to same slot (collisions)

3

slide-4
SLIDE 4

Best idea: just avoid collisions; tailor hash function accordingly However: by assumption |U| > m, so there must be at least two keys with same hash value, thus complete avoidance impossible Thus: whatever h we choose, we still need some form of collision- resolution

4

slide-5
SLIDE 5

Hashing with chaining The simplest of all collision-resolution protocols Does just what you’d expect: each slot really is a list. When ele- ments collide, just insert new guy into the list (“the chain”). Suppose T is your hash table, h your hash function Chained-Hash-Insert(T,x) insert x at head of list T[h(key[x])] Chained-Hash-Search(T,k) search for element with key k in list T[h(k)] Chained-Hash-Delete(T,x) delete x from list T[h(key[x])]

5

slide-6
SLIDE 6

What about running times?

  • Insert: clearly O(1) under assumption that element is not yet in

table; otherwise first search

  • Search: proportional to length of list; more details to come
  • Delete: note: argument is x, not k, thus constant time access,

then another O(1) if doubly-linked lists. If argument were key, then search necessary. If lists singly-linked, still essentially search necessay (need predecessor of x)

6

slide-7
SLIDE 7

Given hash table T with m slots that stores n elements Def load factor α = n/m (avg list size) Analysis in terms of α (not necessarily greater than one!) Clear: worst-case performance poor: if all n keys hash to same slot, then we might just as well have used just one list Average performance depends on how well hash fct h (that we still don’t know) distributes keys, on average

7

slide-8
SLIDE 8

We’ll see more details, but for now (very strong) assumption: Any given element is equally likely to hash into any of the m slots, independently of where other elements hash to. Assumption is called simple uniform hashing Two intuitions come to mind:

  • 1. input is some random sample, hash function is fixed
  • 2. input is fixed, hash function is somehow randomised

8

slide-9
SLIDE 9

For j ∈ {0, . . . , m − 1} let nj = length(T[j]) Clearly, n0 + n1 + · · · + nm−1 = n Also, average value of nj is E[nj] = α = n/m (recall: “equally

  • likely. . . ”)

Another assumption (not necessarly true): hash function h can be ecaluated in O(1) time Thus, time required to search for some element with key k depends linearly on length nh(k) of list T[h(k)]

9

slide-10
SLIDE 10

We consider unsuccessful (no element in table has key k), and suc- cessful searches.

  • Theorem. Under simple uniform hashing, if using collision resulotion

hashing with chaining then an unsuccessful search takes expected time Θ(1 + α) with α = n/m. Proof.

  • any key k not already in table (recall:

unsuccessful) is equally likely hashed to any of the m slots (read: they all look the same for us)

  • expected time to search unsuccessfully for k is expected time to

search to end of T[h(k)]

  • T[h(k)] has expected length E[nh(k)] = α
  • thus expected # examined elements is α
  • add 1 for evaluation of h

Recall: α could be very small, thus Θ(1 + α) does make sense!

10

slide-11
SLIDE 11

For successful searches not all lists equally likely to be searched Probability that list is searched is proportional to # elements it contains (under certain assumptions) We assume element being searched for is equally likely any of the n elements in table. Then we get

  • Theorem. Under simple uniform hashing, if using collision resolution

hashing with chaining then a successful search takes expected time Θ(1 + α) with α = n/m. Proof.

  • # elements examined is 1 more than # elements before x in x’s

list

  • elements before x were inserted after x itself (new elements are

plced at front)

11

slide-12
SLIDE 12

Let xi be i-th element inserted into table, 1 ≤ i ≤ n let ki = key(xi) For keys ki, kj, define Bernoulli r.v. Xij = 1 iff h(ki) = h(kj) Under simple uniform hashing, P(Xij = 1) =

m

  • z=1

P(h(ki) = z) · P(h(kj) = z) =

m

  • z=1

(1/m)2 = 1/m

slide-13
SLIDE 13

Thus E[Xij] = P(Xij = 1) = 1/m, and E

 1

n

n

  • i=1

 1 +

n

  • j=i+1

Xij

   

= 1 n

n

  • i=1

 1 +

n

  • j=i+1

E[Xij]

  = 1

n

n

  • i=1

 1 +

n

  • j=i+1

1 m

 

=

 1

n

n

  • i=1

1

  +  1

n

n

  • i=1

n

  • j=i+1

1 m

 

= 1 + 1 nm

n

  • i=1

n

  • j=i+1

1 = 1 + 1 nm

n

  • i=1

(n − i) = 1 + 1 nm

 

n

  • i=1

n −

n

  • i=1

i

  = 1 + 1

nm

  • n2 − n(n + 1)

2

  • =

1 + n m − n + 1 2m = 1 + α − n + 1 2n α = 1 + α

  • 1 − n + 1

2n

  • = Θ(1 + α)

12

slide-14
SLIDE 14

Consequence: if m (# slots) is at least proportional to n (# ele- ments), then n = O(m) and α = n/m = O(1), thus searching takes constant time on average! Insertion and Deletion also take (worst-case even) constant time (if doubly-linked lists are used), thus all operations take constant time on average! (However: we need assumption of single uniform hashing)

13

slide-15
SLIDE 15

So far, haven’t seen a single hash function What make a good hash function? Satisfies (more or less) assumption of single uniform hashing: Each key is equally likely to hash to any of the m slots, independently

  • f where other keys hash to

However, typically impossible, certainly depending on how keys are chosen (think of evil adversary) Sometimes we know key distribution. Ex: keys are real random numbers in k ∈ [0, 1), independently and uniformly chosen, then h(k) = ⌊k · m⌋ satisfies condition

14

slide-16
SLIDE 16

Usual assumption: universe of keys is {0, 1, 2, . . .}, i.e., somehow interpret real keys as natural numbers (“usually” easy enough. . . ) Two very simple hash functions:

  • 1. Division method: h(k) = k mod m

Ex: hash table has size 25, key k = 234, then h(k) = 234 mod 25 = 9 Quite fast, but drawbacks Want to avoid certain values of m, e.g. pwrs of 2 Why? If m = 2p, then h(k) = k mod m = k mod 2p, the p lowest-

  • rder bits of k

Ex: m = 25 = 32, k = 168, h(k) = 168 mod 32 = 8 = (1000)2, and k = 168 = (10101000)2 Better to make hash depend on all bits of key Good idea (usually) for m: prime not too close to power of two

15

slide-17
SLIDE 17
  • 2. Multiplication mthd: h(k) = ⌊m(kA mod 1)⌋

Uh, what’s that?

  • A is constant with 0 < A < 1
  • Thus kA is real with 0 ≤ kA < k
  • kA mod 1 is fractional part of kA, i.e., kA − ⌊kA⌋

Ex: A = 0.23, k = 234, then kA = 53.82 and kA mod 1 = 0.82 IOW: kA mod 1 ∈ [0, 1)

  • Therefore m(kA mod 1) ∈ [0, m), and

⌊m(kA mod 1)⌋ ∈ [0, 1, . . . , m − 1] Voila! Advantage: value of m not critical Typically power of two (no good with division method!), since then implementation easy (some comments in textbook)

16