SLIDE 1
data structures and algorithms 2020 09 28 lecture 9
SLIDE 2
hash tables trees
SLIDE 3
dynamic sets
elements with key and (possibly) satellite data we wish to add, remove, search (and maybe more) we have seen: heaps, stacks, queues, linked lists now: hashing
SLIDE 4
hashing
hash table is an effective data structure for implementing dictionaries keys for example: strings of characters worst-case for operations usually in Θ(n) with n number of items in practice often much better, search even in O(1) hash table generalizes array where we access address i in O(1) applications of hashing in compilers and cryptography
SLIDE 5 direct-address table
universe of keys: U = {0, . . . , m − 1} with m small use array of length m: T[0 . . . (m − 1)] what is stored in T[k]? either nil if there is no item with key k
- r a pointer x to the item (or element) with key is x.key = k and possibly
satellite data
SLIDE 6
- perations for direct-address table
insert(T, x) add element x T[x.key] := x delete(T, x) remove element x T[x.key] := nil search(T, k) seach for key k return T[k] x is pointer to element with key x.key and satelite data x.element
SLIDE 7
analysis of direct-address table
worst-case of inserting, deleting, searching all in O(1) instead of pointer to the object we can store without pointer in the array drawbacks: if universe of keys U is large we need a lot of storage also if we actually use only a small subset of U and: keys must be integers so: hashing
SLIDE 8
hash tables
hash function maps keys to indices (slots) 0, . . . , m − 1 of a hash table so h : U → {0, . . . , m − 1} element with key k ∈ U hashes to slot h(k) usually more keys than indices: |U| >> m space: reduce storage requirement to size of set of actually used keys time: ideally computing a hash value is easy, on average in O(1)
SLIDE 9
example of a simplistic hash function
keys are first names, additional data phone numbers hash function: length modulo 5 1 2 3 4 (Alice, 0205981555) (Sue, 0620011223) (John, 0201234567) Alice John Sue
SLIDE 10
collisions
problem: different keys may be hashed to the same slots namely if h(k) = h(k′) with k = k′ this is called a collision if number of keys |U| larger than number of slots m then the hash function h cannot be injective (a function f : A → B is injective if a = a′ implies f (a) = f (a′)) even if we cannot totally avoid collisions, we try to avoid as much as possible by taking a ‘good’ hash function
SLIDE 11
do we often have collisions?
for p items and a hash table of size m: mp possibilities for a hash function if p = 8 and m = 10 already 108 possibilities there are
m! (m−p)! possibilities for hashing without collision
if p = 8 and m = 10 then 3 · 4 · . . . · 10 such possibilities illustration: birthday paradox for 23 people the probability that everyone has a unique birthday is < 1
2
that is: for p = 23 and m = 366 the probability of collision is ≥ 1
2
SLIDE 12 how to deal with collisions?
either using chaining put items that hash to the same value in a linked list
use a probe sequence to find alternative slots if necessary
SLIDE 13
chaining: example
hash function is month of birth modulo 5 drawback: pointer structures are expensive 1 2 3 4 (01.01., Sue) ∅ (12.03., John) (16.08., Madonna) ∅
SLIDE 14
solving collisions using chaining
create a list for each slot link records in the same slot into a list slot in hash table points to head of a linked list and is nil if list is empty
SLIDE 15
chaining with doubly linked lists: worst-case analysis
insert element x in hash table T: in O(1) insert at the front of a doubly linked list delete element x from hash table T: in O(1) if lists are doubly-linked if we have the element available, no search is needed we use the doubly linked structure search key k in hash table T: in O(n) with n the size of the dictionary worst-case if every key hashes the same slot, then linear in the total number of elements for exam: know and be able to explain this
SLIDE 16
load factor
assumption: key is hashed to any arbitrary slot, independent of other keys we have n keys and m slots probability of h(k) = h(k′) is 1
m
expected length of list at T[h(k)] is n
m
this is called the load factor α = n
m
SLIDE 17
chaining: average case
for unsuccessful search: compute h(k) and search through the list: in Θ(1 + α) for successful search: also in Θ(1 + α) so if α ∈ O(1) (constant!) then average search time in Θ(1) so for example if n ∈ O(m) (number of slots proportional to number of keys) if hash table is too small it does not work properly !
SLIDE 18
intermezzo: choosing a hash function
in view of the assumption in the analysis: what is a good hash function? distributes keys uniformly and seemingly randomly regularity of keys distribution should not affect uniformity hash values are easy to compute: in O(1) (these properties can be difficult to check)
SLIDE 19
possible hash functions with keys natural numbers
division method: a key k is hashed to k mod m pro: easy to compute contra: not good for all values of m; take for m a prime not too close to a power of 2 multiplication method: a key k is hashed to ⌊(m · (k · c − ⌊k · c⌋))⌋ with c a constant c with 0 < c < 1 which c is good ? remark: we do not consider universal hashing (book 11.3.3)
SLIDE 20
alternative to chaining for solving collisions every slot of the hash table contains either nil or an element for hashing, we make a probe sequence h : U × {0, . . . , m − 1} → {0, . . . , m − 1} that for every key k ∈ U is a permutation of the avalaible slots 0, . . . , m − 1 we only use the table, no pointers, the load factor is at most 1 for insertion: we try the slots of the probe sequene and take the first available one deletion is difficult, so we omit deletion
SLIDE 21 remark: removal is difficult
suppose the hash function gives probe sequence 2, 4, 0, 3, 1 for key a, and probe sequence 2, 3, 4, 0, 1 for key b we insert a, then insert b, then delete a, then search for b if deletion of a gives nil in slot 2, then our search for b fails
- f deletion of a is marked by a special marker, which is skipped in a search,
then search time is also influenced by the amount of markers (not only by the load factor)
SLIDE 22
- pen addressing: linear probing
next probe: try the next address modulo m h(k, i) = h′(k) + i mod m the probe sequence for a key k is h′(k) + 0 mod m h′(k) + 1 mod m h′(k) + 2 mod m . . . h′(k) + m − 1 mod m we get clustering! (and removal difficult, as in general for open addressing)
SLIDE 23
- pen addressing: double hashing
next probe: use the second hash function h(k, i) = h1(k) + i · h2(k) mod m with h2(k) relatively prime to the size of the hash table the probe sequence for a key k is: h1(k) + 0 · h2(k) mod m h1(k) + 1 · h2(k) mod m h1(k) + 2 · h2(k) mod m . . . h1(k) + (m − 1) · h2(k) mod m
SLIDE 24
double hashing: example
m = 13, h(k) = k mod 13, h′(k) = 7 − (k mod 7) k h(k) h′(k) try 18 5 3 5 41 2 1 2 22 9 6 9 44 5 5 5, 10 59 7 4 7 32 6 3 6 31 5 4 5,9,13 73 8 4 8
SLIDE 25
probe sequence: h(k, 0), h(k, 1), . . . , h(k, m − 1) assumption: uniform hashing, that is: each key is equally likely to have any one of the m! permutations as its probe sequence regardless of what happens to the other keys assumption: load factor α = n
m < 1
SLIDE 26
expected number of probes for unsuccessful search
probe 1: with probability n
m collision, so go to probe 2
probe 2: with probability n−1
m−1 collision, so go to probe 3
probe 3: with probability n−2
m−2 collision, so go to probe 4
note:
n−i m−i < n m = α
expected number of probes: 1 + n
m(1 + n−1 m−1(1 + n−2 m−2(. . .))) ≤
1 + α(1 + α(1 + α(. . .))) ≤ 1 + α + α2 + α3 + . . . = Σ∞
i=0αi = 1 1−α
SLIDE 27
we assume α < 1 and uniform hashing inserting, and successful or unsuccessful search in O(1) constant then expected number of probes in O(1) if table is full for 50% then we expect 2 probes if table is full for 90% then we expect 10 probes
SLIDE 28
hash tables trees
SLIDE 29
recap definitions
binary tree: every node has at most 2 successors (empty tree is also a binary tree) depth of a node x: length (number of edges) of a path from the root to x height of a node x: length of a maximal path from x to a leaf height of a tree: height of its root number of levels is height plus one
SLIDE 30 binary tree: linked implementation
linked data structure with nodes containing
- x.key from a totally ordered set
- x.left points to left child of node x
- x.right points to right child of node x
- x.p points to parent of node x
if x.p = nil then x is the root T.root points to the root of the tree (nil if empty tree)
SLIDE 31
binary tree: alternative implementation
remember the heap binary trees can be represented as arrays using the level numbering
SLIDE 32
tree traversals
how can we visit all nodes in a tree exactly once? we will mainly focus on binary trees