Hashing Dynamic Dictionaries Operations: create insert find - - PowerPoint PPT Presentation

hashing dynamic dictionaries
SMART_READER_LITE
LIVE PREVIEW

Hashing Dynamic Dictionaries Operations: create insert find - - PowerPoint PPT Presentation

Hashing Dynamic Dictionaries Operations: create insert find remove max/ min write out in sorted order Only defined for object classes that are Comparable Hash tables Operations: create insert find remove


slide-1
SLIDE 1

Hashing

slide-2
SLIDE 2

Dynamic Dictionaries

Operations:

  • create
  • insert
  • find
  • remove
  • max/ min
  • write out in sorted order

Only defined for object classes that are Comparable

slide-3
SLIDE 3

Hash tables

Operations:

  • create
  • insert
  • find
  • remove
  • max/ min
  • write out in sorted order

Only defined for object classes that are Comparable have equals defined

slide-4
SLIDE 4

Hash tables

Operations:

  • create
  • insert
  • find
  • remove
  • max/ min
  • write out in sorted order

Only defined for object classes that are Comparable have equals defined

Java specific: From the Java documentation

slide-5
SLIDE 5

Hash tables – implementation

  • Have a table (an array) of a fixed tableSize
  • A hash function determines where in this table each

item should be stored item hash(item) [a positive integer] % tableSize THE DESIGN QUESTIONS 1. Choosing tableSize 2. Choosing a hash function 3. What to do when a collision occurs

2174 % 10 = 4

slide-6
SLIDE 6

Hash tables – tableSize

  • Should depend on the (maximum) number of values to be stored
  • Let λ = [number of values stored]/ tableSize
  • Load factor of the hash table
  • Restrict λ to be at most 1 (or ½)
  • Require tableSize to be a prime number
  • to “randomize” away any patterns that may arise in the hash function

values

  • The prime should be of the form (4k+3)

[for reasons to be detailed later]

slide-7
SLIDE 7

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer

slide-8
SLIDE 8

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer

slide-9
SLIDE 9

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer

slide-10
SLIDE 10

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer Java-specific

  • Every class has a default hashCode() method that returns an integer
  • May be (should be) overridden
  • Required properties

consistent with the class’s equals() method need not be consistent across different runs of the program different objects may return the same value!

slide-11
SLIDE 11

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer Java-specific

  • Every class has a default hashCode() method that returns an integer
  • May be (should be) overridden
  • Required properties

consistent with the class’s equals() method need not be consistent across different runs of the program different objects may return the same value! From the Java 1.5.0 documentation

http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#hashCode%28%29

slide-12
SLIDE 12

Hash tables – collision resolution

The universe of possible items is usually far greater than tableSize Collision: when multiple items hash on to the same location (aka cell or bucket) Collision resolution strategies specify what to do in case of collision 1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing

slide-13
SLIDE 13

Hash tables – implementation

  • Have a table (an array) of a fixed tableSize
  • A hash function determines where in this table each

item should be stored item hash(item) [a positive integer] % tableSize THE DESIGN QUESTIONS 1. Choosing tableSize 2. Choosing a hash function 3. What to do when a collision occurs

slide-14
SLIDE 14

Hash tables – tableSize

Restrict the load factor λ = [number of values stored]/ tableSize to be at most 1 (or ½) Require tableSize to be a prime number of the form (4k + 3)

slide-15
SLIDE 15

Hash tables – the hash function

If the objects to be stored have integer keys (e.g., student IDs) hash(k) = k is generally OK, unless the keys have “patterns” Otherwise, some “randomized” way to obtain an integer Java-specific

  • Every class has a default hashCode() method that returns an integer
  • May be overridden
  • Required properties

consistent with the class’s equals() method need not be consistent across different runs of the program different objects may return the same value!

slide-16
SLIDE 16

Hash tables – collision resolution

The universe of possible items is usually far greater than tableSize Collision: when multiple items hash on to the same location (aka cell or bucket) Collision resolution strategies specify what to do in case of collision 1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing

slide-17
SLIDE 17

Hash tables – collision resolution: chaining

Maintain a linked list at each cell/ bucket (The hash table is an array of linked lists) Insert: at front of list

  • if pre-condition is “not already in list,” then faster
  • in any case, later-inserted items often accessed more frequently (the LRU principle)

Example: Insert 02, 12, 22, …, 92 into an initially empty hash table with tableSize = 10 [Note: bad choice of tableSize – only to make the example easier!!]

slide-18
SLIDE 18

Maintain a linked list at each cell/ bucket (The hash table is an array of linked lists) Insert: at front of list

  • if pre-cond is that not already in list, then faster
  • in any case, later-inserted items often accessed more frequently

Example: Insert 02, 12, 22, …, 92 into an initially empty hash table with tableSize = 10 [Note: bad choice of tableSize – only to make the example easier!!]

Hash tables – collision resolution: chaining

slide-19
SLIDE 19

Maintain a linked list at each cell/ bucket (The hash table is an array of linked lists) Insert: at front of list

  • if pre-cond is that not already in list, then faster
  • in any case, later-inserted items often accessed more frequently

Find and Remove: obvious implementations Worst-case run-time: Θ(N) per operation (all elements in the same list) Average case: O(λ) per operation Design rule: for chaining, keep λ ≤ 1 If λ becomes greater than 1, rehash (later)

Hash tables – collision resolution: chaining

The load factor: [number of items stored]/tableSize

slide-20
SLIDE 20

Hash tables – collision resolution: probing

1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing In case of collision, try alternative locations until an empty cell is found

  • [Open address]

Probe sequence: ho(x), h1(x), h2(x), …, with hi(x) = [hash(x) + f(i)] % tableSize The function f(i) is different for the different probing methods

Avoids the use of dynamic memory

f(i) is a linear function of i – typically, f(i) = i

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using linear probing

slide-21
SLIDE 21

Hash tables – collision resolution: linear probing

1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing In case of collision, try alternative locations until an empty cell is found

  • [Open address]

Probe sequence: ho(x), h1(x), h2(x), …, with hi(x) = [hash(x) + f(i)] % tableSize The function f(i) is different for the different probing methods

Avoids the use of dynamic memory

f(i) is a linear function of i – typically, f(i) = i

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using linear probing

slide-22
SLIDE 22

Hash tables - review

Supports the basic dynamic dictionary ops: insert, find, remove Does not need class to be Comparable Three design decisions: tableSize, hash function, collision resolution Table size

a prime of the form (4k+3), keeping load factor constraints in mind

Hash function

should “randomize” the items Java’s hashCode() method

Collision resolution: chaining Collision resolution: probing (open addressing) – linear probing The clustering problem

slide-23
SLIDE 23

Hash tables - clustering

Two causes of clustering: multiple keys hash on to the same location (secondary clustering) multiple keys hash on to the same cluster (primary clustering) Secondary clustering caused by hash function; primary, by choice of probe sequence Number of probes per operation increases with load factor

slide-24
SLIDE 24

Hash tables – collision resolution: probing

1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing f(i) is a quadratic function of i (e.g., f(i) = i2)

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using quadratic probing

slide-25
SLIDE 25

Hash tables – collision resolution: quadratic probing

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using quadratic probing

slide-26
SLIDE 26

Hash tables – collision resolution: quadratic probing

Two causes of clustering: multiple keys hash on to the same location (secondary clustering) multiple keys hash on to the same cluster (primary clustering) Which one does quadratic probing solve? primary clustering Efficient implementation of i2 à (i+1)2: (i+1) and (2i+1) in parallel, and then add i2 and (2i+1) Choosing tableSize:

  • prime: at least half the table gets probed
  • prime of the form (4k+3) and probe sequence is ± i2: entire table gets probed

Remove: lazy delete must be used

slide-27
SLIDE 27

Hash tables – collision resolution: probing

1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing

To get rid of secondary clustering Use two hash functions: hash1(.) and hash2(.) Probe sequence “step” size is hash2(.)

  • [Unlikely distinct items agree on both hash1(.) and hash2(.)]

hash2(.) must never evaluate to zero! A common (good) choice: R – (x mod R), for R a prime smaller than tableSize

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using double hashing with hash2(x) = 7 – x mod 7

slide-28
SLIDE 28

Hash tables – collision resolution: double hashing

Example: insert 89, 18, 49, 58, and 69 into a table of size 10, using double hashing with hash2(x) = 7 – x mod 7

slide-29
SLIDE 29

Hash tables – collision resolution: probing

1. Chaining (closed addressing) 2. Probing (open addressing) a. Linear probing b. Quadratic probing c. Double Hashing d. Perfect Hashing e. Cuckoo Hashing

slide-30
SLIDE 30

Hash tables – collision resolution: Cuckoo hashing

Goal: constant-time O(1) find in the worst case Example application: network routing tables [remove also takes O(1) time] Insert has worst-case Θ(N) run-time Keep two hash tables, and use two different hash functions

slide-31
SLIDE 31

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

slide-32
SLIDE 32

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

C: hash1(C) = 1, hash2(C) = 4

C

D: hash1(D) = 1, hash2(D) = 0

D

slide-33
SLIDE 33

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

C: hash1(C) = 1, hash2(C) = 4

C

D: hash1(D) = 1, hash2(D) = 0

D

E: hash1(E) = 3, hash2(E) = 2

E

F: hash1(F) = 3, hash2(F) = 4

F

slide-34
SLIDE 34

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

C: hash1(C) = 1, hash2(C) = 4

C

D: hash1(D) = 1, hash2(D) = 0

D

E: hash1(E) = 3, hash2(E) = 2

E

F: hash1(F) = 3, hash2(F) = 4

F

slide-35
SLIDE 35

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

C: hash1(C) = 1, hash2(C) = 4

C

D: hash1(D) = 1, hash2(D) = 0

D

E: hash1(E) = 3, hash2(E) = 2

E

F: hash1(F) = 3, hash2(F) = 4

F

slide-36
SLIDE 36

Hash tables – collision resolution: Cuckoo hashing

TABLE 1 TABLE 2 1 2 3 4

A: hash1(A) = 0, hash2(A) = 2

A

B: hash1(B) = 0, hash2(B) = 0

B

C: hash1(C) = 1, hash2(C) = 4

C

D: hash1(D) = 1, hash2(D) = 0

D

E: hash1(E) = 3, hash2(E) = 2

E

F: hash1(F) = 3, hash2(F) = 4

F

slide-37
SLIDE 37

Hash tables – collision resolution: Cuckoo hashing

Insert

  • Insert into Table 1, using hash1
  • If cell is already occupied
  • bump item into other table (using appropriate hash function)
  • Repeat
  • Rehash after k repetitions

Each table should be more than half empty Stronger condition than load factor ≤ ½

slide-38
SLIDE 38

Rehashing

When load factor becomes too large… (Approximately) double tableSize Scan old table, inserting each non-deleted item into the new table Worst-case time?

  • O(N2)

Average-case: O(N) Amortized analysis

Average cost per insert, over a sequence of repeated re-hashings [Not great for interactive applications…]

slide-39
SLIDE 39
slide-40
SLIDE 40

Hash tables - review

Supports the basic dynamic dictionary ops: insert, find, remove Three design decisions: tableSize, hash function, collision resolution Table size: a prime of the form (4k+3), keeping load factor constraints in mind Hash function

Java’s hashCode() method

item goes to hash(item) % tableSize Collision: multiple items at the same location Collision resolution:-chaining Collision resolution: -probing (open addressing)

  • Linear probing
  • Quadratic probing
  • Double Hashing
  • Cuckoo Hashing
slide-41
SLIDE 41

Java-specific – hashCode() and equals()

public class Employee { String name; int id; public Employee(String n, int i){name = n; id = i;} public boolean equals(Employee e){ return (name == e.name && id == e.id); } }

… …

public static void main(String[] args) { Employee e1=new Employee("weiss", 001); Employee e2=new Employee("weiss", 001); System.out.println(e1.hashCode() + ", " + e2.hashCode()); System.out.println(e1 == e2); System.out.println(e1.equals(e2)); Employee e2 = e1;

slide-42
SLIDE 42

f(i) can be any linear function (a * i + b) If gcd(a, tableSize) = 1, then linear probing will probe the entire table Primary clustering: blocks of occupied cells start forming even in a relatively empty table

Hash tables – collision resolution: linear probing

any item hashing here…

slide-43
SLIDE 43

f(i) can be any linear function (a * i + b) If gcd(a, tableSize) = 1, then linear probing will probe the entire table Primary clustering: blocks of occupied cells start forming even in a relatively empty table

Hash tables – collision resolution: linear probing

any item hashing here… grows the cluster by one

slide-44
SLIDE 44

f(i) can be any linear function (a * i + b) If gcd(a, tableSize) = 1, then linear probing will probe the entire table Primary clustering: blocks of occupied cells start forming even in a relatively empty table

Hash tables – collision resolution: linear probing

any item hashing here… merges the two clusters

slide-45
SLIDE 45

Hash tables - clustering

Two causes of clustering: multiple keys hash on to the same location (secondary clustering) multiple keys hash on to the same cluster (primary clustering) Secondary clustering caused by hash function; primary, by choice of probe sequence Number of probes per operation increases with load factor

slide-46
SLIDE 46

Hash tables – linear probing: remove

1 2 3 4 5 6 7 8 9

insert A; hash(A) = 4

A

insert B; hash(B) = 5

B

insert C; hash(C) = 4

C

remove B find C Remove must be implemented as lazy delete!!

  • Load factor computed including lazy-deleted items
  • In inserts, may “reclaim” lazy-deleted cells