conditional course
play

Conditional Course Lecture 4 Hash Tables I: Separate Chaining and - PowerPoint PPT Presentation

Algorithms and Data Structures Conditional Course Lecture 4 Hash Tables I: Separate Chaining and Open Addressing Fabian Kuhn Algorithms and Complexity Fabian Kuhn Algorithms and Data Structures Abstract Data Types: Dictionary Dictionary:


  1. Algorithms and Data Structures Conditional Course Lecture 4 Hash Tables I: Separate Chaining and Open Addressing Fabian Kuhn Algorithms and Complexity Fabian Kuhn Algorithms and Data Structures

  2. Abstract Data Types: Dictionary Dictionary: (also: maps, associative arrays) • holds a collection of elements where each element is represented by a unique key Operations: • create : creates an empty dictionary • D.insert(key, value) : inserts a new (key,value)- pair – If there already is an entry with the same key , the old entry is replaced • D.find(key) : returns entry with key key – If there is such an entry (returns some default value otherwise) • D.delete(key) : deletes entry with key key Fabian Kuhn Algorithms and Data Structures 2

  3. Dictionary so far • So far, we saw 3 simple dictionary implementations Linked List Array Array (unsorted) (unsorted) (sorted) 𝑷(𝟐) 𝑷(𝟐) 𝑷(𝒐) insert 𝑷(𝒐) 𝑷(𝒐) 𝑷(𝒐) delete 𝑷(𝒐) 𝑷(𝒐) 𝑷 𝐦𝐩𝐡 𝒐 find 𝑜 : current number of elements in dictionary • Often the most important operation: find • Can we improve find even more? • Can we make all operations fast? Fabian Kuhn Algorithms and Data Structures 3

  4. Direct Addressing With an array, we can make everything fast, ...if the array is sufficiently large. Assumption: Keys are integers between 0 and 𝑁 − 1 0 None find(2)  “Value 1” 1 None 2 Value 1 3 None insert(6, “Philipp”) None 4 Value 2 5 None delete(4) Philipp 6 None 7 Value 3 8 None ⋮ ⋮ 𝑁 − 1 None Fabian Kuhn Algorithms and Data Structures 4

  5. Direct Addressing : Problems 1. Direct addressing requires too much space! – If each key can be an arbitrary int (32 bit): We need an array of size 2 32 ≈ 4 ⋅ 10 9 . For 64 bit integers, we even need more than 10 19 entries … 2. What if the keys are no integers? – Where do we store the (key,value) -pair (“Philipp”, “ assistent ”) ? – Where do we store the key 3.14159 ? – Pythagoras: “Everything is number” “ Everything” can be stored as a sequence of bits: Interpret bit sequence as integer – Makes the space problem even worse! Fabian Kuhn Algorithms and Data Structures 5

  6. Hashing : Idea Problem • Huge space 𝑇 of possible keys • Number 𝑜 of acutally used keys is much smaller – We would like to use an array of size ≈ 𝑜 (resp. 𝑃(𝑜) )… • How can be map 𝑁 keys to 𝑃 𝑜 array positions? 𝑵 possible keys random mapping size 𝑃(𝑜) 𝑜 keys Fabian Kuhn Algorithms and Data Structures 6

  7. Hash Functions Key Space 𝑻 , 𝑻 = 𝑵 (all possible keys) Array size 𝒏 ( ≈ maximum #keys we want to store) Hash Function 𝒊: 𝑻 → {𝟏, … , 𝒏 − 𝟐} • Maps keys of key space 𝑇 to array positions • ℎ should be as close as possible to a random function – all numbers in {0, … , 𝑛 − 1} mapped to from roughly the same #keys – similar keys should be mapped to different positions • ℎ should be computable as fast as possible – if possible in time 𝑃(1) – will be considered a basic operation in the following (cost = 1) Fabian Kuhn Algorithms and Data Structures 7

  8. Hash Tables insert( 𝑙 1 , 𝑤 1 ) 1. insert( 𝑙 2 , 𝑤 2 ) 2. insert( 𝑙 3 , 𝑤 3 ) 3. Hash table 0 None collision! 1 None 2 None ℎ 𝑙 3 = 3 𝒍 𝟒 (𝒍 𝟐 , 𝒘 𝟐 ) 3 None 4 None 5 None 6 None 𝒍 𝟑 , 𝒘 𝟑 7 None 𝒍 𝟐 8 None ⋮ ⋮ 𝑛 − 1 None 𝒍 𝟑 Fabian Kuhn Algorithms and Data Structures 8

  9. Hash Tables : Collisions Collision: Two keys 𝑙 1 , 𝑙 2 collide if ℎ 𝑙 1 = ℎ(𝑙 2 ) . What should we do in case of a collision? • Can we choose hash function such that there are no collisions? – This is only possible if we know the used keys before choosing the hash function. – Even then, choosing such a hash function can be very expensive. • Use another hash function? – One would need to choose a new hash function for every new collision – A new hash function means that one needs to relocate all the already inserted values in the hash table. • Further ideas? Fabian Kuhn Algorithms and Data Structures 9

  10. Hash Tables : Collisions Approaches for Dealing With Collisions • Assumption: Keys 𝑙 1 and 𝑙 2 collide 1. Store both (key,value) pairs at the same position – The hash table needs to have space to store multiple entries at each position. – We do not want to just increase the size of the table (then, we chould have just started with a larger table …) – Solution: Use linked lists 2. Store second key at a different position – Can for example be done with a second hash function – Problem: At the alternative position, there could again be a collision – There are multiple solutions – One solution: use many possible new positions (One has to make sure that these positions are usually not used …) Fabian Kuhn Algorithms and Data Structures 10

  11. Separate Chaining • Each position of the hash table points to a linked list Hash table 0 None 1 None 2 None 𝒘 𝟐 𝒘 𝟒 3 4 None 5 None 6 None 7 𝒘 𝟑 8 None Space usage: 𝑷(𝒏 + 𝒐) ⋮ ⋮ • table size 𝑛 , no. of elements 𝑜 𝑛 − 1 None Fabian Kuhn Algorithms and Data Structures 11

  12. Runtime Hash Table Operations To make it simple, first for the case without collisions … create: 𝑷 𝟐 insert: 𝑷(𝟐) 𝑷(𝟐) find: delete: 𝑷(𝟐) • As long as there are no collisions, hash tables are extremely fast (if hash functions can be evaluated in constant time) • We will see that this is also true with collisions… Fabian Kuhn Algorithms and Data Structures 12

  13. Runtime Separate Chaining Now, let’s consider collisions … create: 𝑷 1 insert: 𝑷(1 + length of list) – If one does not need to check if the key is already contained, insert can even be always be done in time 𝑃 1 . 𝑷(1 + length of list) find: delete: 𝑷(1 + length of list) • We therefore has to see how long the lists become. Fabian Kuhn Algorithms and Data Structures 13

  14. Separate Chaining : Worst Case Worst case for separate chaining: • All keys that appear have the same hash value • Results in a linked list of length 𝑜 Hashtabelle 𝒐−𝟐 𝟐 • Probability for random ℎ : 0 None 𝒏 1 None 2 None ℎ 𝑙 1 = 3 𝒍 𝟐 3 4 None 5 None 6 None 7 None 8 None ⋮ ⋮ 𝒍 𝟑 m − 1 None Fabian Kuhn Algorithms and Data Structures 14

  15. Length of Linked Lists • Cost of insert , find, and delete depends on the length of the corresponding list • How long do the lists become? – Assumption: Size of hash table 𝑛 , number of entries 𝑜 – Additional assumption: Hash function ℎ behaves as a random function • List lengths correspond to the following random experiment 𝒏 bins and 𝒐 balls • Each ball is thrown (independently) into a random bin • Longest list = maximal no. of balls in the same bin • Average list length = average no. of balls per bin 𝑜 𝑛 𝑛 bins, 𝑜 balls  average #balls per bin: Τ Fabian Kuhn Algorithms and Data Structures 15

  16. Balls and Bins 𝒐 balls 𝒏 bins • Worst-case runtime = Θ max #balls per bin log 𝑜 𝑜 𝑛 + Τ with high probability (whp) ∈ 𝑃 ൗ log log 𝑜 log 𝑜 – for 𝑜 ≤ 𝑛 : 𝑃 ൗ log log 𝑜 log 𝑜 • The longest list will have length Θ ൗ log log 𝑜 . Fabian Kuhn Algorithms and Data Structures 16

  17. Balls and Bins 𝒐 balls 𝒏 bins Expected runtime (for every key): • Key in table: – List length of a random entry – Corresponds to #balls in bin of a random ball • Key not in table: – Length of a random list, i.e., #balls in a random bin Fabian Kuhn Algorithms and Data Structures 17

  18. Expected Runtime of Find Load 𝜷 of hash table: 𝜷 ≔ 𝒐 𝒏 Cost of search: • Search for key 𝑦 that is not contained in hash table ℎ(𝑦) is a uniformly random position  expected list length = average list length = 𝛽 Expected runtime: 𝑷(𝟐 + 𝜷) ℎ(𝑦) find(𝑦) time: 𝑃(1) go through a random list: 𝑃(𝛽) Fabian Kuhn Algorithms and Data Structures 18

  19. Expected Runtime of Find Load 𝜷 of hash table: 𝜷 ≔ 𝒐 𝒏 Cost of search : • Search for key 𝑦 that is contained in hash table How many keys 𝑧 ≠ 𝑦 are in the list of 𝑦 ? • The other keys are distributed randomly, the expected number thus corresponds to the expected number of entries in a random list of a hash table with 𝑜 − 1 entries (all entries except 𝑦 ). 𝑜−1 𝑜 • This is: 𝑛 < 𝑛 = 𝛽  expected list length of 𝑦 < 1 + 𝛽 Expected runtime: 𝑷(𝟐 + 𝜷) Fabian Kuhn Algorithms and Data Structures 19

  20. Runtimes Separate Chaining create: • runtime 𝑃 1 insert, find & delete: • worst case: 𝚰(𝒐) 𝐦𝐩𝐡 𝒐 • worst case with high probability (for random ℎ ): 𝑷 𝜷 + 𝐦𝐩𝐡 𝐦𝐩𝐡 𝒐 • Expected runtime (for fixed key 𝑦 ): 𝑷 𝟐 + 𝜷 – holds for successful and unsuccessful searches – if 𝛽 = 𝑃 1 (i.e., hash table has size Ω 𝑜 ), this is 𝑃(1) • Hash tables are extremely efficient and typically have 𝑷 𝟐 runtime for all operations . Fabian Kuhn Algorithms and Data Structures 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend