hash tables
play

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS MapD Talk Thursday Sept 20 th @ 12:00pm CIC 4 th Floor CMU 15-445/645 (Fall


  1. Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

  2. 2 UPCO M IN G DATABASE EVEN TS MapD Talk → Thursday Sept 20 th @ 12:00pm → CIC 4 th Floor CMU 15-445/645 (Fall 2018)

  3. 3 ADM IN ISTRIVIA Project #1 is due Wednesday Sept 26 th @ 11:59pm Homework #2 is due Friday Sept 28 th @ 11:59pm CMU 15-445/645 (Fall 2018)

  4. 4 REM IN DER If you have a question during the lecture, raise your hand and stop me. Do not come up to the front after the lecture. There are no stupid questions (*) . CMU 15-445/645 (Fall 2018)

  5. 5 CO URSE STATUS We are now going to talk about how Query Planning to support the DBMS's execution engine to read/write data from pages. Operator Execution Access Methods Two types of data structures: → Hash Tables Buffer Pool Manager → Trees Disk Manager CMU 15-445/645 (Fall 2018)

  6. 6 DATA STRUCTURES Internal Meta-data Core Data Storage Temporary Data Structures Table Indexes CMU 15-445/645 (Fall 2018)

  7. 7 DESIGN DECISIO N S Data Organization → How we layout data structure in memory/pages and what information to store to support efficient access. Concurrency → How to enable multiple threads to access the data structure at the same time without causing problems. CMU 15-445/645 (Fall 2018)

  8. 8 H ASH TABLES A hash table implements an associative array abstract data type that maps keys to values. It uses a hash function to compute an offset into the array, from which the desired value can be found. CMU 15-445/645 (Fall 2018)

  9. 9 STATIC H ASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 abc record. 1 Ø 2 def To find an entry, mod the key by the ⋮ number of elements to find the offset n xyz in the array. CMU 15-445/645 (Fall 2018)

  10. 9 STATIC H ASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 record. abcdefghi 1 2 xyz123 To find an entry, mod the key by the ⋮ number of elements to find the offset defghijk n in the array. CMU 15-445/645 (Fall 2018)

  11. 10 ASSUM PTIO N S You know the number of elements hash(key) ahead of time. 0 abcdefghi 1 Each key is unique. 2 xyz123 ⋮ Perfect hash function. defghijk n → If key1≠key2 , then hash(key1)≠hash(key2) CMU 15-445/645 (Fall 2018)

  12. 11 H ASH TABLE Design Decision #1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision #2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys. CMU 15-445/645 (Fall 2018)

  13. 12 TO DAY'S AGEN DA Hash Functions Static Hashing Schemes Dynamic Hashing Schemes CMU 15-445/645 (Fall 2018)

  14. 13 H ASH FUN CTIO N S We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate. CMU 15-445/645 (Fall 2018)

  15. 14 H ASH FUN CTIO N S MurmurHash (2008) → Designed to a fast, general purpose hash function. Google CityHash (2011) → Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes). Google FarmHash (2014) → Newer version of CityHash with better collision rates. CLHash (2016) → Fast hashing function based on carry-less multiplication. CMU 15-445/645 (Fall 2018)

  16. 15 H ASH FUN CTIO N BEN CH M ARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 18000 64 32 192 Throughput (MB/sec) 128 12000 6000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018)

  17. 16 H ASH FUN CTIO N BEN CH M ARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 192 36000 128 Throughput (MB/sec) 24000 64 32 12000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018)

  18. 17 STATIC H ASH IN G SCH EM ES Approach #1: Linear Probe Hashing Approach #2: Robin Hood Hashing Approach #3: Cuckoo Hashing CMU 15-445/645 (Fall 2018)

  19. 18 LIN EAR PRO BE H ASH IN G Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a location in the index and scan for it. → Have to store the key in the index to know when to stop scanning. → Insertions and deletions are generalizations of lookups. CMU 15-445/645 (Fall 2018)

  20. 19 LIN EAR PRO BE H ASH IN G hash(key) A B | val A <key>|<value> C D E F CMU 15-445/645 (Fall 2018)

  21. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C D E F CMU 15-445/645 (Fall 2018)

  22. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D E F CMU 15-445/645 (Fall 2018)

  23. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F CMU 15-445/645 (Fall 2018)

  24. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F | val E CMU 15-445/645 (Fall 2018)

  25. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F | val E | val F CMU 15-445/645 (Fall 2018)

  26. 20 N O N- UN IQ UE KEYS Choice #1: Separate Linked List → Store values in separate storage area for each key. CMU 15-445/645 (Fall 2018)

  27. 20 N O N- UN IQ UE KEYS Value Lists Choice #1: Separate Linked List value1 XYZ → Store values in separate storage area for value2 ABC each key. value3 value1 value2 Choice #2: Redundant Keys → Store duplicate keys entries together in the hash table. XYZ | value1 ABC | value1 XYZ | value2 XYZ | value3 ABC | value2 CMU 15-445/645 (Fall 2018)

  28. 21 O BSERVATIO N To reduce the # of wasteful comparisons, it is important to avoid collisions of hashed keys. This requires a hash table with ~2x the number of slots as the number of elements. CMU 15-445/645 (Fall 2018)

  29. 22 RO BIN H O O D H ASH IN G Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys. → Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key. CMU 15-445/645 (Fall 2018)

  30. 23 RO BIN H O O D H ASH IN G hash(key) A B | val [0] # of "Jumps" From First Position A C D E F CMU 15-445/645 (Fall 2018)

  31. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C D E F CMU 15-445/645 (Fall 2018)

  32. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == C[0] C | val [1] C D E F CMU 15-445/645 (Fall 2018)

  33. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C | val [1] C[1] > D[0] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  34. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  35. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  36. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [1] D[1] < E[2] E D F CMU 15-445/645 (Fall 2018)

  37. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [2] D[1] < E[2] E E F | val [2] D CMU 15-445/645 (Fall 2018)

  38. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C | val [1] C D | val [2] E E F | val [2] D[2] > F[0] D | val [1] F CMU 15-445/645 (Fall 2018)

  39. 24 CUCKO O H ASH IN G Use multiple hash tables with different hash functions. → On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location. Look-ups and deletions are always O(1) because only one location per hash table is checked. CMU 15-445/645 (Fall 2018)

  40. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  41. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) A | val ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  42. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  43. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A B | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  44. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A B | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ Insert C hash 1 (C) hash 2 (C) CMU 15-445/645 (Fall 2018)

  45. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A C | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ Insert C hash 1 (C) hash 2 (C) CMU 15-445/645 (Fall 2018)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend