data structures and algorithms 2020 09 28 lecture 9 overview hash - - PowerPoint PPT Presentation

data structures and algorithms 2020 09 28 lecture 9
SMART_READER_LITE
LIVE PREVIEW

data structures and algorithms 2020 09 28 lecture 9 overview hash - - PowerPoint PPT Presentation

data structures and algorithms 2020 09 28 lecture 9 overview hash tables trees dynamic sets elements with key and (possibly) satellite data we wish to add, remove, search (and maybe more) we have seen: heaps, stacks, queues, linked lists


slide-1
SLIDE 1

data structures and algorithms 2020 09 28 lecture 9

slide-2
SLIDE 2
  • verview

hash tables trees

slide-3
SLIDE 3

dynamic sets

elements with key and (possibly) satellite data we wish to add, remove, search (and maybe more) we have seen: heaps, stacks, queues, linked lists now: hashing

slide-4
SLIDE 4

hashing

hash table is an effective data structure for implementing dictionaries keys for example: strings of characters worst-case for operations usually in Θ(n) with n number of items in practice often much better, search even in O(1) hash table generalizes array where we access address i in O(1) applications of hashing in compilers and cryptography

slide-5
SLIDE 5

direct-address table

universe of keys: U = {0, . . . , m − 1} with m small use array of length m: T[0 . . . (m − 1)] what is stored in T[k]? either nil if there is no item with key k

  • r a pointer x to the item (or element) with key is x.key = k and possibly

satellite data

slide-6
SLIDE 6
  • perations for direct-address table

insert(T, x) add element x T[x.key] := x delete(T, x) remove element x T[x.key] := nil search(T, k) seach for key k return T[k] x is pointer to element with key x.key and satelite data x.element

slide-7
SLIDE 7

analysis of direct-address table

worst-case of inserting, deleting, searching all in O(1) instead of pointer to the object we can store without pointer in the array drawbacks: if universe of keys U is large we need a lot of storage also if we actually use only a small subset of U and: keys must be integers so: hashing

slide-8
SLIDE 8

hash tables

hash function maps keys to indices (slots) 0, . . . , m − 1 of a hash table so h : U → {0, . . . , m − 1} element with key k ∈ U hashes to slot h(k) usually more keys than indices: |U| >> m space: reduce storage requirement to size of set of actually used keys time: ideally computing a hash value is easy, on average in O(1)

slide-9
SLIDE 9

example of a simplistic hash function

keys are first names, additional data phone numbers hash function: length modulo 5 1 2 3 4 (Alice, 0205981555) (Sue, 0620011223) (John, 0201234567) Alice John Sue

slide-10
SLIDE 10

collisions

problem: different keys may be hashed to the same slots namely if h(k) = h(k′) with k = k′ this is called a collision if number of keys |U| larger than number of slots m then the hash function h cannot be injective (a function f : A → B is injective if a = a′ implies f (a) = f (a′)) even if we cannot totally avoid collisions, we try to avoid as much as possible by taking a ‘good’ hash function

slide-11
SLIDE 11

do we often have collisions?

for p items and a hash table of size m: mp possibilities for a hash function if p = 8 and m = 10 already 108 possibilities there are

m! (m−p)! possibilities for hashing without collision

if p = 8 and m = 10 then 3 · 4 · . . . · 10 such possibilities illustration: birthday paradox for 23 people the probability that everyone has a unique birthday is < 1

2

that is: for p = 23 and m = 366 the probability of collision is ≥ 1

2

slide-12
SLIDE 12

how to deal with collisions?

either using chaining put items that hash to the same value in a linked list

  • r using open addressing

use a probe sequence to find alternative slots if necessary

slide-13
SLIDE 13

chaining: example

hash function is month of birth modulo 5 drawback: pointer structures are expensive 1 2 3 4 (01.01., Sue) ∅ (12.03., John) (16.08., Madonna) ∅

slide-14
SLIDE 14

solving collisions using chaining

create a list for each slot link records in the same slot into a list slot in hash table points to head of a linked list and is nil if list is empty

slide-15
SLIDE 15

chaining with doubly linked lists: worst-case analysis

insert element x in hash table T: in O(1) insert at the front of a doubly linked list delete element x from hash table T: in O(1) if lists are doubly-linked if we have the element available, no search is needed we use the doubly linked structure search key k in hash table T: in O(n) with n the size of the dictionary worst-case if every key hashes the same slot, then linear in the total number of elements for exam: know and be able to explain this

slide-16
SLIDE 16

load factor

assumption: key is hashed to any arbitrary slot, independent of other keys we have n keys and m slots probability of h(k) = h(k′) is 1

m

expected length of list at T[h(k)] is n

m

this is called the load factor α = n

m

slide-17
SLIDE 17

chaining: average case

for unsuccessful search: compute h(k) and search through the list: in Θ(1 + α) for successful search: also in Θ(1 + α) so if α ∈ O(1) (constant!) then average search time in Θ(1) so for example if n ∈ O(m) (number of slots proportional to number of keys) if hash table is too small it does not work properly !

slide-18
SLIDE 18

intermezzo: choosing a hash function

in view of the assumption in the analysis: what is a good hash function? distributes keys uniformly and seemingly randomly regularity of keys distribution should not affect uniformity hash values are easy to compute: in O(1) (these properties can be difficult to check)

slide-19
SLIDE 19

possible hash functions with keys natural numbers

division method: a key k is hashed to k mod m pro: easy to compute contra: not good for all values of m; take for m a prime not too close to a power of 2 multiplication method: a key k is hashed to ⌊(m · (k · c − ⌊k · c⌋))⌋ with c a constant c with 0 < c < 1 which c is good ? remark: we do not consider universal hashing (book 11.3.3)

slide-20
SLIDE 20
  • pen addressing

alternative to chaining for solving collisions every slot of the hash table contains either nil or an element for hashing, we make a probe sequence h : U × {0, . . . , m − 1} → {0, . . . , m − 1} that for every key k ∈ U is a permutation of the avalaible slots 0, . . . , m − 1 we only use the table, no pointers, the load factor is at most 1 for insertion: we try the slots of the probe sequene and take the first available one deletion is difficult, so we omit deletion

slide-21
SLIDE 21

remark: removal is difficult

suppose the hash function gives probe sequence 2, 4, 0, 3, 1 for key a, and probe sequence 2, 3, 4, 0, 1 for key b we insert a, then insert b, then delete a, then search for b if deletion of a gives nil in slot 2, then our search for b fails

  • f deletion of a is marked by a special marker, which is skipped in a search,

then search time is also influenced by the amount of markers (not only by the load factor)

slide-22
SLIDE 22
  • pen addressing: linear probing

next probe: try the next address modulo m h(k, i) = h′(k) + i mod m the probe sequence for a key k is h′(k) + 0 mod m h′(k) + 1 mod m h′(k) + 2 mod m . . . h′(k) + m − 1 mod m we get clustering! (and removal difficult, as in general for open addressing)

slide-23
SLIDE 23
  • pen addressing: double hashing

next probe: use the second hash function h(k, i) = h1(k) + i · h2(k) mod m with h2(k) relatively prime to the size of the hash table the probe sequence for a key k is: h1(k) + 0 · h2(k) mod m h1(k) + 1 · h2(k) mod m h1(k) + 2 · h2(k) mod m . . . h1(k) + (m − 1) · h2(k) mod m

slide-24
SLIDE 24

double hashing: example

m = 13, h(k) = k mod 13, h′(k) = 7 − (k mod 7) k h(k) h′(k) try 18 5 3 5 41 2 1 2 22 9 6 9 44 5 5 5, 10 59 7 4 7 32 6 3 6 31 5 4 5,9,13 73 8 4 8

slide-25
SLIDE 25
  • pen addressing: analysis

probe sequence: h(k, 0), h(k, 1), . . . , h(k, m − 1) assumption: uniform hashing, that is: each key is equally likely to have any one of the m! permutations as its probe sequence regardless of what happens to the other keys assumption: load factor α = n

m < 1

slide-26
SLIDE 26

expected number of probes for unsuccessful search

probe 1: with probability n

m collision, so go to probe 2

probe 2: with probability n−1

m−1 collision, so go to probe 3

probe 3: with probability n−2

m−2 collision, so go to probe 4

note:

n−i m−i < n m = α

expected number of probes: 1 + n

m(1 + n−1 m−1(1 + n−2 m−2(. . .))) ≤

1 + α(1 + α(1 + α(. . .))) ≤ 1 + α + α2 + α3 + . . . = Σ∞

i=0αi = 1 1−α

slide-27
SLIDE 27
  • pen addressing: remarks

we assume α < 1 and uniform hashing inserting, and successful or unsuccessful search in O(1) constant then expected number of probes in O(1) if table is full for 50% then we expect 2 probes if table is full for 90% then we expect 10 probes

slide-28
SLIDE 28
  • verview

hash tables trees

slide-29
SLIDE 29

recap definitions

binary tree: every node has at most 2 successors (empty tree is also a binary tree) depth of a node x: length (number of edges) of a path from the root to x height of a node x: length of a maximal path from x to a leaf height of a tree: height of its root number of levels is height plus one

slide-30
SLIDE 30

binary tree: linked implementation

linked data structure with nodes containing

  • x.key from a totally ordered set
  • x.left points to left child of node x
  • x.right points to right child of node x
  • x.p points to parent of node x

if x.p = nil then x is the root T.root points to the root of the tree (nil if empty tree)

slide-31
SLIDE 31

binary tree: alternative implementation

remember the heap binary trees can be represented as arrays using the level numbering

slide-32
SLIDE 32

tree traversals

how can we visit all nodes in a tree exactly once? we will mainly focus on binary trees