advanced database
play

ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ - PowerPoint PPT Presentation

Lect ure # 17 ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ Andy_Pavlo // 15- 721 // Spring 2020 2 Background Parallel Hash Join Hash Functions Hashing Schemes Evaluation 15-721 (Spring 2020) 3 PARALLEL J O IN ALGO


  1. 20 BUILD PH ASE The threads are then to scan either the tuples (or partitions) of R . For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table. → The buckets should only be a few cache lines in size. 15-721 (Spring 2020)

  2. 21 H ASH TABLE Design Decision #1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision #2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys. 15-721 (Spring 2020)

  3. 22 H ASH FUN CTIO N S We do not want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate. → Best Speed: Always return ' 1 ' → Best Collision Rate: Perfect hashing See SMHasher for a comprehensive hash function benchmark suite. 15-721 (Spring 2020)

  4. 23 H ASH FUN CTIO N S CRC-64 (1975) → Used in networking for error detection. MurmurHash (2008) → Designed to a fast, general purpose hash function. Google CityHash (2011) → Designed to be faster for short keys (<64 bytes). Facebook XXHash (2012) → From the creator of zstd compression. Google FarmHash (2014) → Newer version of CityHash with better collision rates. 15-721 (Spring 2020)

  5. 24 H ASH FUN CTIO N BEN CH M ARK Intel Core i7-8700K @ 3.70GHz crc64 std::hash MurmurHash3 CityHash FarmHash XXHash3 28000 128 Throughput (MB/sec) 64 192 21000 32 14000 7000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund 15-721 (Spring 2020)

  6. 25 H ASH IN G SCH EM ES Approach #1: Chained Hashing Approach #2: Linear Probe Hashing Approach #3: Robin Hood Hashing Approach #4: Hopscotch Hashing Approach #5: Cuckoo Hashing 15-721 (Spring 2020)

  7. 26 CH AIN ED H ASH IN G Maintain a linked list of buckets for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket. → To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups. 15-721 (Spring 2020)

  8. 27 CH AIN ED H ASH IN G hash(key) A B C D E F 15-721 (Spring 2020)

  9. 27 CH AIN ED H ASH IN G hash(key) A B | A hash(A) C Buckets D E F 15-721 (Spring 2020)

  10. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D E F 15-721 (Spring 2020)

  11. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D | C hash(C) E F 15-721 (Spring 2020)

  12. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D | C hash(C) E F 15-721 (Spring 2020)

  13. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C hash(C) E F 15-721 (Spring 2020)

  14. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F 15-721 (Spring 2020)

  15. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F | F hash(F) 15-721 (Spring 2020)

  16. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) 64-bit Bucket Pointers ¤ 48-bit Pointer A 16-bit Bloom Filter B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F | F hash(F) 15-721 (Spring 2020)

  17. 28 LIN EAR PRO BE H ASH IN G Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a location in the table and scan for it. → Must store the key in the table to know when to stop scanning. → Insertions and deletions are generalizations of lookups. 15-721 (Spring 2020)

  18. 29 LIN EAR PRO BE H ASH IN G hash(key) A B | A hash(A) C D E F 15-721 (Spring 2020)

  19. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  20. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  21. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  22. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  23. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  24. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F | E hash(E) 15-721 (Spring 2020)

  25. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F | E hash(E) | F hash(F) 15-721 (Spring 2020)

  26. 30 O BSERVATIO N To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys. This requires a chained hash table with ~2 × the number of slots as the # of elements in R . 15-721 (Spring 2020)

  27. 31 RO BIN H O O D H ASH IN G Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys. → Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key. ROBIN HOOD H HASHING FOUNDATIONS O OF COMPUTER SCIENCE 1985 15-721 (Spring 2020)

  28. 32 RO BIN H O O D H ASH IN G hash(key) A B | A [0] # of "Jumps" From First Position hash(A) C D E F 15-721 (Spring 2020)

  29. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C D E F 15-721 (Spring 2020)

  30. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == C[0] C D E F 15-721 (Spring 2020)

  31. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == C[0] C | C [1] hash(C) D E F 15-721 (Spring 2020)

  32. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] C[1] > D[0] hash(C) D E F 15-721 (Spring 2020)

  33. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] C[1] > D[0] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  34. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  35. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  36. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | D [1] D[1] < E[2] E hash(D) F 15-721 (Spring 2020)

  37. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | E [2] D[1] < E[2] E hash(E) F | D [2] hash(D) 15-721 (Spring 2020)

  38. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] hash(C) D | E [2] E hash(E) F | D [2] D[2] > F[0] hash(D) | F [1] hash(F) 15-721 (Spring 2020)

  39. 33 H O PSCOTCH H ASH IN G Variant of linear probe hashing where keys can move between positions in a neighborhood . → A neighborhood is contiguous range of slots in the table. → The size of a neighborhood is a configurable constant. A key is guaranteed to be in its neighborhood or not exist in the table. HOPSCOTCH HASHING SYMPOSIUM ON DISTRIBUTED COMPUTING 2008 15-721 (Spring 2020)

  40. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B C D E F 15-721 (Spring 2020)

  41. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A B C D E F 15-721 (Spring 2020)

  42. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A Neighborhood #2 B Neighborhood #3 C D ⋮ E F 15-721 (Spring 2020)

  43. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B Neighborhood #3 C D E F 15-721 (Spring 2020)

  44. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  45. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A B | A hash(A) C D E F 15-721 (Spring 2020)

  46. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B Neighborhood #1 hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  47. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  48. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  49. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  50. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  51. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  52. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  53. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  54. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  55. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  56. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  57. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F | D hash(D) 15-721 (Spring 2020)

  58. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | E E hash(E) F | D hash(D) 15-721 (Spring 2020)

  59. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | E E hash(E) F | D Neighborhood #6 hash(D) 15-721 (Spring 2020)

  60. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | E E hash(E) F | D Neighborhood #6 hash(D) | F hash(F) 15-721 (Spring 2020)

  61. 35 CUCKO O H ASH IN G Use multiple tables with different hash functions. → On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location. Look-ups are always O(1) because only one location per hash table is checked. 15-721 (Spring 2020)

  62. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 ⋮ ⋮ 15-721 (Spring 2020)

  63. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) ⋮ ⋮ 15-721 (Spring 2020)

  64. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) hash 1 (X) | X ⋮ ⋮ 15-721 (Spring 2020)

  65. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) hash 1 (X) | X Insert Y hash 1 (Y) hash 2 (Y) ⋮ ⋮ 15-721 (Spring 2020)

  66. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 2 (Y) | Y hash 1 (X) hash 2 (X) hash 1 (X) | X Insert Y hash 1 (Y) hash 2 (Y) ⋮ ⋮ 15-721 (Spring 2020)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend