DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #20: PARALLEL JOIN ALGORITHMS (HASHING) 2 ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #20: PARALLEL JOIN ALGORITHMS (HASHING)

slide-2
SLIDE 2

ANATOMY OF A DATABASE SYSTEM

Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager

2

Query Transactional Storage Manager Query Processor Shared Utilities Process Manager

Source: Anatomy of a Database System

slide-3
SLIDE 3

TODAY’S AGENDA

Background Parallel Hash Join Hash Functions Hashing Schemes Evaluation

3

slide-4
SLIDE 4

PARALLEL JOIN ALGORITHMS

Perform a join between two relations on multiple threads simultaneously to speed up operation. Two main approaches:

→ Hash Join → Sort-Merge Join

We won’t discuss nested-loop joins…

4

slide-5
SLIDE 5

OBSERVATION

Many OLTP DBMSs don’t implement hash join. But an index nested-loop join with a small number of target tuples is more or less equivalent to a hash join.

5

slide-6
SLIDE 6

HASHING VS. SORTING

1970s – Sorting 1980s – Hashing 1990s – Equivalent 2000s – Hashing 2010s – Hashing (Partitioned vs. Non-Partitioned) 2020s – ???

6

slide-7
SLIDE 7

PARALLEL JOIN ALGORITHMS

7

slide-8
SLIDE 8

PARALLEL JOIN ALGORITHMS

8

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

slide-9
SLIDE 9

PARALLEL JOIN ALGORITHMS

9

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DE DESIGN AND D EVALUATION OF MAIN ME MEMO MORY HASH JOIN ALGORITHMS MS FO FOR MULTI-CO CORE CP CPUS SIGMOD 2011

slide-10
SLIDE 10

PARALLEL JOIN ALGORITHMS

10

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

→ Sort-Merge is already faster than Hashing, even without SIMD.

MA MASSIVELY PARALLEL SORT-ME MERGE JO JOINS IN MAIN MEMORY MULTI- CO CORE DATABA BASE SYSTEMS VLDB 2012

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DE DESIGN AND D EVALUATION OF MAIN ME MEMO MORY HASH JOIN ALGORITHMS MS FO FOR MULTI-CO CORE CP CPUS SIGMOD 2011

slide-11
SLIDE 11

PARALLEL JOIN ALGORITHMS

11

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

→ Sort-Merge is already faster than Hashing, even without SIMD.

MA MASSIVELY PARALLEL SORT-ME MERGE JO JOINS IN MAIN MEMORY MULTI- CO CORE DATABA BASE SYSTEMS VLDB 2012

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DE DESIGN AND D EVALUATION OF MAIN ME MEMO MORY HASH JOIN ALGORITHMS MS FO FOR MULTI-CO CORE CP CPUS SIGMOD 2011

→ Ignore what we said last year. → You really want to use Hashing!

MA MASSIVELY PARALLEL NUMA MA- AW AWAR ARE HAS ASH JOINS IMDM 2013

slide-12
SLIDE 12

PARALLEL JOIN ALGORITHMS

12

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

→ Sort-Merge is already faster than Hashing, even without SIMD.

MA MASSIVELY PARALLEL SORT-ME MERGE JO JOINS IN MAIN MEMORY MULTI- CO CORE DATABA BASE SYSTEMS VLDB 2012

→ New optimizations and results for Radix Hash Join.

MA MAIN-ME MEMO MORY HASH JOINS ON MU MULTI-CO CORE CP CPUS: TUNING TO TH THE UNDERLYING HARDWARE ICDE 2013

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DE DESIGN AND D EVALUATION OF MAIN ME MEMO MORY HASH JOIN ALGORITHMS MS FO FOR MULTI-CO CORE CP CPUS SIGMOD 2011

→ Ignore what we said last year. → You really want to use Hashing!

MA MASSIVELY PARALLEL NUMA MA- AW AWAR ARE HAS ASH JOINS IMDM 2013

slide-13
SLIDE 13

PARALLEL JOIN ALGORITHMS

13

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SO SORT VS.

  • S. HASH

SH REVISI SITED: FAST ST JO JOIN IMPLEMENTATION ON MO MODERN MU MULTI-CO CORE CP CPUS VLDB 2009

→ Sort-Merge is already faster than Hashing, even without SIMD.

MA MASSIVELY PARALLEL SORT-ME MERGE JO JOINS IN MAIN MEMORY MULTI- CO CORE DATABA BASE SYSTEMS VLDB 2012

→ New optimizations and results for Radix Hash Join.

MA MAIN-ME MEMO MORY HASH JOINS ON MU MULTI-CO CORE CP CPUS: TUNING TO TH THE UNDERLYING HARDWARE ICDE 2013

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DE DESIGN AND D EVALUATION OF MAIN ME MEMO MORY HASH JOIN ALGORITHMS MS FO FOR MULTI-CO CORE CP CPUS SIGMOD 2011

→ Ignore what we said last year. → You really want to use Hashing!

MA MASSIVELY PARALLEL NUMA MA- AW AWAR ARE HAS ASH JOINS IMDM 2013

→ Hold up everyone! Let's look at everything for real!

AN AN EXPERIMENTAL AL COMPAR ARISON OF OF THIRTEEN RELATION ONAL EQ EQUI- JO JOINS IN MAIN MEMORY SIGMOD 2016

slide-14
SLIDE 14

JOIN ALGORITHM DESIGN GOALS

Goal #1: Minimize Synchronization

→ Avoid taking latches during execution.

Goal #2: Minimize CPU Cache Misses

→ Ensure that data is always local to worker thread.

14

slide-15
SLIDE 15

IMPROVING CACHE BEHAVIOR

Factors that affect cache misses in a DBMS:

→ Cache + TLB capacity. → Locality (temporal and spatial).

Non-Random Access (Scan):

→ Clustering to a cache line. → Execute more operations per cache line.

Random Access (Lookups):

→ Partition data to fit in cache + TLB.

15

Source: Johannes Gehrke

slide-16
SLIDE 16

PARALLEL HASH JOINS

Hash join is the most important operator in a DBMS for OLAP workloads. It’s important that we speed it up by taking advantage of multiple cores.

→ We want to keep all of the cores busy, without becoming memory bound

16

slide-17
SLIDE 17

CLOUDERA IMPALA

17

% of Total CPU Time Spent in Query Operators Workload: TPC-H Benchmark

slide-18
SLIDE 18

CLOUDERA IMPALA

18

49.6%

25.0%

3.1%

19.9%

2.4%

HASH JOIN SEQ SCAN UNION AGGREGATE OTHER

% of Total CPU Time Spent in Query Operators Workload: TPC-H Benchmark

slide-19
SLIDE 19

CLOUDERA IMPALA

19

49.6%

25.0%

3.1%

19.9%

2.4%

HASH JOIN SEQ SCAN UNION AGGREGATE OTHER

% of Total CPU Time Spent in Query Operators Workload: TPC-H Benchmark

slide-20
SLIDE 20

HASH JOIN (R⨝S)

Phase #1: Partition (optional)

→ Divide the tuples of R and S into sets using a hash on the join key.

Phase #2: Build

→ Scan relation R and create a hash table on join key.

Phase #3: Probe

→ For each tuple in S, look up its join key in hash table for

  • R. If a match is found, output combined tuple.

20

slide-21
SLIDE 21

PARTITION PHASE

Split the input relations into partitioned buffers by hashing the tuples’ join key(s).

→ Ideally the cost of partitioning is less than the cost of cache misses during build phase. → Sometimes called hybrid hash join.

Contents of buffers depends on storage model:

→ NSM: Either the entire tuple or a subset of attributes. → DSM: Only the columns needed for the join + offset.

21

slide-22
SLIDE 22

PARTITION PHASE

Approach #1: Non-Blocking Partitioning

→ Only scan the input relation once. → Produce output incrementally.

Approach #2: Blocking Partitioning (Radix)

→ Scan the input relation multiple times. → Only materialize results all at once. → Sometimes called radix hash join.

22

slide-23
SLIDE 23

NON-BLOCKING PARTITIONING

Scan the input relation only once and generate the

  • utput on-the-fly.

Approach #1: Shared Partitions

→ Single global set of partitions that all threads update. → Have to use a latch to synchronize threads.

Approach #2: Private Partitions

→ Each thread has its own set of partitions. → Have to consolidate them after all threads finish.

23

slide-24
SLIDE 24

SHARED PARTITIONS

24

Data Table

A B C

slide-25
SLIDE 25

SHARED PARTITIONS

25

Data Table

A B C

slide-26
SLIDE 26

SHARED PARTITIONS

26

Data Table

A B C

slide-27
SLIDE 27

SHARED PARTITIONS

27

Data Table

A B C

hashP(key) #p #p #p

slide-28
SLIDE 28

Partitions

SHARED PARTITIONS

28

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-29
SLIDE 29

Partitions

SHARED PARTITIONS

29

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-30
SLIDE 30

Partitions

SHARED PARTITIONS

30

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-31
SLIDE 31

Partitions

SHARED PARTITIONS

31

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-32
SLIDE 32

PRIVATE PARTITIONS

32

Data Table

A B C

hashP(key) #p #p #p

slide-33
SLIDE 33

Partitions

PRIVATE PARTITIONS

33

Data Table

A B C

hashP(key) #p #p #p

slide-34
SLIDE 34

Partitions

PRIVATE PARTITIONS

34

Data Table

A B C

hashP(key) #p #p #p

slide-35
SLIDE 35

Partitions

PRIVATE PARTITIONS

35

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

slide-36
SLIDE 36

Partitions

PRIVATE PARTITIONS

36

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

slide-37
SLIDE 37

Partitions

PRIVATE PARTITIONS

37

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

slide-38
SLIDE 38

RADIX PARTITIONING

Scan the input relation multiple times to generate the partitions. Multi-step pass over the relation:

→ Step #1: Scan R and compute a histogram of the # of tuples per hash key for the radix at some offset. → Step #2: Use this histogram to determine output offsets by computing the prefix sum. → Step #3: Scan R again and partition them according to the hash key.

38

slide-39
SLIDE 39

RADIX

The radix is the value of an integer at a particular position (using its base).

39

89 12 23 08 41 64 Input

slide-40
SLIDE 40

RADIX

The radix is the value of an integer at a particular position (using its base).

40

89 12 23 08 41 64 Input

slide-41
SLIDE 41

RADIX

The radix is the value of an integer at a particular position (using its base).

41

89 12 23 08 41 64 9 2 3 8 1 4 Input Radix

slide-42
SLIDE 42

RADIX

The radix is the value of an integer at a particular position (using its base).

42

89 12 23 08 41 64 Input Radix

slide-43
SLIDE 43

RADIX

The radix is the value of an integer at a particular position (using its base).

43

89 12 23 08 41 64 Input Radix 8 1 2 4 6

slide-44
SLIDE 44

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

44

1 2 3 4 5 6 Input

slide-45
SLIDE 45

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

45

1 2 3 4 5 6 1 Input Prefix Sum

slide-46
SLIDE 46

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

46

+ 1 2 3 4 5 6 1 Input Prefix Sum

slide-47
SLIDE 47

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

47

+ 1 2 3 4 5 6 1 3 Input Prefix Sum

slide-48
SLIDE 48

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

48

+ + + + + 1 2 3 4 5 6 1 3 6 10 15 21 Input Prefix Sum

slide-49
SLIDE 49

RADIX PARTITIONS

49

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-50
SLIDE 50

RADIX PARTITIONS

50

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-51
SLIDE 51

RADIX PARTITIONS

51

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-52
SLIDE 52

RADIX PARTITIONS

52

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-53
SLIDE 53

RADIX PARTITIONS

53

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-54
SLIDE 54

RADIX PARTITIONS

54

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-55
SLIDE 55

RADIX PARTITIONS

55

Step #1: Inspect input, create histograms

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-56
SLIDE 56

RADIX PARTITIONS

56

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

Step #2: Compute output

  • ffsets

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-57
SLIDE 57

RADIX PARTITIONS

57

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #2: Compute output

  • ffsets

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-58
SLIDE 58

RADIX PARTITIONS

58

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-59
SLIDE 59

RADIX PARTITIONS

59

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 3

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-60
SLIDE 60

RADIX PARTITIONS

60

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-61
SLIDE 61

RADIX PARTITIONS

61

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 1

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-62
SLIDE 62

RADIX PARTITIONS

62

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 1

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-63
SLIDE 63

RADIX PARTITIONS

63

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-64
SLIDE 64

RADIX PARTITIONS

64

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-65
SLIDE 65

RADIX PARTITIONS

65

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-66
SLIDE 66

BUILD PHASE

The threads are then to scan either the tuples (or partitions) of R. For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

→ The buckets should only be a few cache lines in size.

66

slide-67
SLIDE 67

HASH TABLE

Design Decision #1: Hash Function

→ How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate.

Design Decision #2: Hashing Scheme

→ How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys.

67

slide-68
SLIDE 68

HASH FUNCTIONS

We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate.

68

slide-69
SLIDE 69

HASH FUNCTIONS

MurmurHash (2008)

→ Designed to a fast, general purpose hash function.

Google CityHash (2011)

→ Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes).

Google FarmHash (2014)

→ Newer version of CityHash with better collision rates.

CLHash (2016)

→ Fast hashing function based on carry-less multiplication.

69

slide-70
SLIDE 70

HASH FUNCTION BENCHMARKS

70

6000 12000 18000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

slide-71
SLIDE 71

HASH FUNCTION BENCHMARKS

71

6000 12000 18000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

32 64 128 192

slide-72
SLIDE 72

HASH FUNCTION BENCHMARKS

72

12000 24000 36000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

32 64 128 192

slide-73
SLIDE 73

HASHING SCHEMES

Approach #1: Chained Hashing Approach #2: Linear Hashing Approach #3: Robin Hood Hashing Approach #4: Cuckoo Hashing

73

slide-74
SLIDE 74

CHAINED HASHING

Maintain a linked list of “buckets” for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket.

→ To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups.

74

slide-75
SLIDE 75

CHAINED HASHING

75

Ø

hash(key) ⋮ ⋮

slide-76
SLIDE 76

LINEAR HASHING

Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.

→ To determine whether an element is present, hash to a location in the table and scan for it. → Have to store the key in the table to know when to stop scanning. → Insertions are generalizations of lookups.

76

slide-77
SLIDE 77

LINEAR HASHING

77

A B C D hash(key) E F

slide-78
SLIDE 78

LINEAR HASHING

78

A B C D hash(key) | A

hash(A)

E F

slide-79
SLIDE 79

LINEAR HASHING

79

A B C D hash(key) | A

hash(A)

| B

hash(B)

E F

slide-80
SLIDE 80

LINEAR HASHING

80

A B C D hash(key) | A

hash(A)

| B

hash(B)

E F

slide-81
SLIDE 81

LINEAR HASHING

81

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

E F

slide-82
SLIDE 82

LINEAR HASHING

82

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E F

slide-83
SLIDE 83

LINEAR HASHING

83

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E F

slide-84
SLIDE 84

LINEAR HASHING

84

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E | E

hash(E)

F

slide-85
SLIDE 85

LINEAR HASHING

85

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E | E

hash(E)

F | F

hash(F)

slide-86
SLIDE 86

OBSERVATION

To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys. This requires a chained hash table with ~2x the number of slots as the # of elements in R.

86

slide-87
SLIDE 87

ROBIN HOOD HASHING

Variant of linear hashing that steals slots from "rich" keys and give them to "poor" keys.

→ Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key.

87

RO ROBIN N HOOD HASHING NG Foundations of Computer Science 1985

slide-88
SLIDE 88

ROBIN HOOD HASHING

88

A B C D hash(key) E F

slide-89
SLIDE 89

ROBIN HOOD HASHING

89

A B C D hash(key) | A [0]

hash(A)

E F

slide-90
SLIDE 90

ROBIN HOOD HASHING

90

A B C D hash(key) | A [0]

hash(A)

E

# of "Jumps" From First Position

F

slide-91
SLIDE 91

ROBIN HOOD HASHING

91

A B C D hash(key) | A [0]

hash(A)

E F

slide-92
SLIDE 92

ROBIN HOOD HASHING

92

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

E F

slide-93
SLIDE 93

ROBIN HOOD HASHING

93

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

E F

A[0] == C[0]

slide-94
SLIDE 94

ROBIN HOOD HASHING

94

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E F

A[0] == C[0]

slide-95
SLIDE 95

ROBIN HOOD HASHING

95

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E F

C[1] > D[0]

slide-96
SLIDE 96

ROBIN HOOD HASHING

96

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E F

C[1] > D[0]

slide-97
SLIDE 97

ROBIN HOOD HASHING

97

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E F

slide-98
SLIDE 98

ROBIN HOOD HASHING

98

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0]

F

slide-99
SLIDE 99

ROBIN HOOD HASHING

99

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0] C[1] == E[1]

F

slide-100
SLIDE 100

ROBIN HOOD HASHING

100

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0] C[1] == E[1] D[1] < E[2]

F

slide-101
SLIDE 101

ROBIN HOOD HASHING

101

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E | E [2]

hash(E) A[0] == E[0] C[1] == E[1] D[1] < E[2]

F

slide-102
SLIDE 102

ROBIN HOOD HASHING

102

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E | E [2]

hash(E) A[0] == E[0] C[1] == E[1] D[1] < E[2]

F | D [2]

hash(D)

slide-103
SLIDE 103

ROBIN HOOD HASHING

103

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E | E [2]

hash(E)

F | D [2]

hash(D)

| F [1]

hash(F) D[2] > F[0]

slide-104
SLIDE 104

CUCKOO HASHING

Use multiple tables with different hash functions.

→ On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

Look-ups are always O(1) because only one location per hash table is checked.

104

slide-105
SLIDE 105

CUCKOO HASHING

105

Hash Table #1 ⋮ Hash Table #2 ⋮

slide-106
SLIDE 106

CUCKOO HASHING

106

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

slide-107
SLIDE 107

CUCKOO HASHING

107

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

slide-108
SLIDE 108

CUCKOO HASHING

108

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

slide-109
SLIDE 109

CUCKOO HASHING

109

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

X

slide-110
SLIDE 110

CUCKOO HASHING

110

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X

slide-111
SLIDE 111

CUCKOO HASHING

111

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X

slide-112
SLIDE 112

CUCKOO HASHING

112

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Y

slide-113
SLIDE 113

CUCKOO HASHING

113

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Y Insert Z

hash1(Z) hash2(Z)

slide-114
SLIDE 114

CUCKOO HASHING

114

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Y Insert Z

hash1(Z) hash2(Z)

slide-115
SLIDE 115

CUCKOO HASHING

115

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Y Insert Z

hash1(Z) hash2(Z)

slide-116
SLIDE 116

CUCKOO HASHING

116

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Insert Z

hash1(Z) hash2(Z)

Z

slide-117
SLIDE 117

CUCKOO HASHING

117

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Insert Z

hash1(Z) hash2(Z)

Z

hash1(Y)

slide-118
SLIDE 118

CUCKOO HASHING

118

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

X Insert Z

hash1(Z) hash2(Z)

Z

hash1(Y)

slide-119
SLIDE 119

CUCKOO HASHING

119

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

Insert Z

hash1(Z) hash2(Z)

Z

hash1(Y)

Y

slide-120
SLIDE 120

CUCKOO HASHING

120

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

Insert Z

hash1(Z) hash2(Z)

Z

hash1(Y)

Y

hash2(X)

slide-121
SLIDE 121

CUCKOO HASHING

121

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

Insert Z

hash1(Z) hash2(Z)

Z

hash1(Y)

Y

hash2(X)

X

slide-122
SLIDE 122

CUCKOO HASHING

Threads have to make sure that they don’t get stuck in an infinite loop when moving keys. If we find a cycle, then we can rebuild the entire hash tables with new hash functions.

→ With two hash functions, we (probably) won’t need to rebuild the table until it is at about 50% full. → With three hash functions, we (probably) won’t need to rebuild the table until it is at about 90% full.

122

slide-123
SLIDE 123

PROBE PHASE

For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

→ If inputs were partitioned, then assign each thread a unique partition. → Otherwise, synchronize their access to the cursor on S

123

slide-124
SLIDE 124

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

124

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-125
SLIDE 125

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

125

A B

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-126
SLIDE 126

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

126

A B

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-127
SLIDE 127

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

127

A B

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-128
SLIDE 128

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

128

A B

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-129
SLIDE 129

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

129

A B

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-130
SLIDE 130

PROBE PHASE – BLOOM FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

130

A B

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE SIGMOD 2013

slide-131
SLIDE 131

HASH JOIN VARIANTS

131

No-P Shared-P Private-P Radix

Partitioning

No Yes Yes Yes

Input scans

1 1 2

Sync during partitioning

– Spinlock per tuple Barrier,

  • nce at end

Barrier, 4 * #passes

Hash table

Shared Private Private Private

Sync during build phase

Yes No No No

Sync during probe phase

No No No No

slide-132
SLIDE 132

BENCHMARKS

Primary key – foreign key join

→ Outer Relation (Build): 16M tuples, 16 bytes each → Inner Relation (Probe): 256M tuples, 16 bytes each

Uniform and highly skewed (Zipf; s=1.25) No output materialization

132

DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI-CORE CPUS SIGMOD 2011

slide-133
SLIDE 133

HASH JOIN – UNIFORM DATA SET

133

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

60.2 67.6 76.8 47.3

Source: Spyros Blanas

slide-134
SLIDE 134

HASH JOIN – UNIFORM DATA SET

134

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

60.2 67.6 76.8 47.3

24% faster than No Partitioning 3.3x cache misses 70x TLB misses

Source: Spyros Blanas

slide-135
SLIDE 135

HASH JOIN – SKEWED DATA SET

135

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

25.2 167.1 56.5 50.7

Source: Spyros Blanas

slide-136
SLIDE 136

OBSERVATION

We have ignored a lot of important parameters for all of these algorithms so far.

→ Whether to use partitioning or not? → How many partitions to use? → How many passes to take in partitioning phase?

In a real DBMS, the optimizer will select what it thinks are good values based on what it knows about the data (and maybe hardware).

136

slide-137
SLIDE 137

PARTING THOUGHTS

On modern CPUs, a simple hash join algorithm that does not partition inputs is competitive. There are additional vectorization execution

  • ptimizations that are possible in hash joins that

we didn’t talk about. But these don’t really help…

145

slide-138
SLIDE 138

NEXT CLASS

Parallel Sort-Merge Joins

146