15-721 ADVANCED DATABASE SYSTEMS Lecture #18 Parallel Join - - PowerPoint PPT Presentation

15 721
SMART_READER_LITE
LIVE PREVIEW

15-721 ADVANCED DATABASE SYSTEMS Lecture #18 Parallel Join - - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #18 Parallel Join Algorithms (Hashing) Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background Parallel


slide-1
SLIDE 1

Andy Pavlo / / Carnegie Mellon University / / Spring 2016

ADVANCED

DATABASE SYSTEMS

Lecture #18 – Parallel Join Algorithms (Hashing)

15-721

@Andy_Pavlo // Carnegie Mellon University // Spring 2017

slide-2
SLIDE 2

CMU 15-721 (Spring 2017)

TODAY’S AGENDA

Background Parallel Hash Join Hash Functions Hash Table Implementations Evaluation

2

slide-3
SLIDE 3

CMU 15-721 (Spring 2017)

PARALLEL JOIN ALGORITHMS

Perform a join between two relations on multiple threads simultaneously to speed up operation. Two main approaches:

→ Hash Join → Sort-Merge Join

We won’t discuss nested-loop joins…

3

slide-4
SLIDE 4

CMU 15-721 (Spring 2017)

OBSERVATION

Many OLTP DBMSs don’t implement hash join. But a index nested-loop join with a small number of target tuples is more or less equivalent to a hash join.

4

slide-5
SLIDE 5

CMU 15-721 (Spring 2017)

HASHING VS. SORTING

1970s – Sorting 1980s – Hashing 1990s – Equivalent 2000s – Hashing 2010s – ???

5

slide-6
SLIDE 6

CMU 15-721 (Spring 2017)

PARALLEL JOIN ALGORITHMS

6

→Hashing is faster than Sort-Merge. →Sort-Merge will be faster with wider SIMD.

SO SORT VS.

  • S. HASH

SH REV EVISI SITED ED: FAST ST JO JOIN I N IMPLEMENT NTATION O N ON N MO MODERN MUL MULTI-CO CORE CPU CPUS VLDB 2009

→Sort-Merge is already faster, even without SIMD.

MASSI SSIVEL ELY PARA RALLEL S SORT ORT- MERG RGE JOI OINS I IN MAIN MEMORY ORY MUL MULTI-CORE D E DATABASE SY SE SYST STEM EMS VLDB 2012

→New optimizations and results for Radix Hash Join.

MAI AIN-MEMORY ORY HASH JOI OINS ON ON MUL MULTI-CO CORE CPU CPUS: : TUNING T TO THE U UNDERL RLYI YING H HARD RDWARE RE ICDE 2013

Source: Cagri Balkesen

slide-7
SLIDE 7

CMU 15-721 (Spring 2017)

JOIN ALGORITHM DESIGN GOALS

Goal #1: Minimize Synchronization

→ Avoid taking latches during execution.

Goal #2: Minimize CPU Cache Misses

→ Ensure that data is always local to worker thread.

7

slide-8
SLIDE 8

CMU 15-721 (Spring 2017)

IMPROVING CACHE BEHAVIOR

Factors that affect cache misses in a DBMS:

→ Cache + TLB capacity. → Locality (temporal and spatial).

Non-Random Access (Scan):

→ Clustering to a cache line. → Execute more operations per cache line.

Random Access (Lookups):

→ Partition data to fit in cache + TLB.

8

Source: Johannes Gehrke

slide-9
SLIDE 9

CMU 15-721 (Spring 2017)

PARALLEL HASH JOINS

Hash join is the most important operator in a DBMS for OLAP workloads. It’s important that we speed it up by taking advantage of multiple cores.

→ We want to keep all of the cores busy, without becoming memory bound

9

DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI-CORE CPUS SIGMOD 2011

slide-10
SLIDE 10

CMU 15-721 (Spring 2017)

CLOUDERA IMPALA

10

49.6%

25.0%

3.1%

19.9%

2.4%

HASH JOIN SEQ SCAN UNION AGGREGATE OTHER

% of Total CPU Time Spent in Query Operators Workload: TPC-H Benchmark

slide-11
SLIDE 11

CMU 15-721 (Spring 2017)

HASH JOIN (R⨝S)

Phase #1: Partition (optional)

→ Divide the tuples of R and S into sets using a hash on the join key.

Phase #2: Build

→ Scan relation R and create a hash table on join key.

Phase #3: Probe

→ For each tuple in S, look up its join key in hash table for

  • R. If a match is found, output combined tuple.

11

slide-12
SLIDE 12

CMU 15-721 (Spring 2017)

PARTITION PHASE

Split the input relations into partitioned buffers by hashing the tuples’ join key(s).

→ The hash function used for this phase should be different than the one used in the build phase. → Ideally the cost of partitioning is less than the cost of cache misses during build phase.

Contents of buffers depends on storage model:

→ NSM: Either the entire tuple or a subset of attributes. → DSM: Only the columns needed for the join + offset.

12

slide-13
SLIDE 13

CMU 15-721 (Spring 2017)

PARTITION PHASE

Approach #1: Non-Blocking Partitioning

→ Only scan the input relation once. → Produce output incrementally.

Approach #2: Blocking Partitioning (Radix)

→ Scan the input relation multiple times. → Only materialize results all at once.

13

slide-14
SLIDE 14

CMU 15-721 (Spring 2017)

NON-BLOCKING PARTITIONING

Scan the input relation only once and generate the

  • utput on-the-fly.

Approach #1: Shared Partitions

→ Single global set of partitions that all threads update. → Have to use a latch to synchronize threads.

Approach #2: Private Partitions

→ Each thread has its own set of partitions. → Have to consolidate them after all threads finish.

14

slide-15
SLIDE 15

CMU 15-721 (Spring 2017)

SHARED PARTITIONS

15

Data Table

A B C

slide-16
SLIDE 16

CMU 15-721 (Spring 2017)

SHARED PARTITIONS

15

Data Table

A B C

hashP(key) #p #p #p

slide-17
SLIDE 17

CMU 15-721 (Spring 2017)

Partitions

SHARED PARTITIONS

15

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-18
SLIDE 18

CMU 15-721 (Spring 2017)

Partitions

SHARED PARTITIONS

15

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

slide-19
SLIDE 19

CMU 15-721 (Spring 2017)

Partitions

PRIVATE PARTITIONS

16

Data Table

A B C

hashP(key) #p #p #p

slide-20
SLIDE 20

CMU 15-721 (Spring 2017)

Partitions

PRIVATE PARTITIONS

16

Data Table

A B C

hashP(key) #p #p #p

slide-21
SLIDE 21

CMU 15-721 (Spring 2017)

Partitions

PRIVATE PARTITIONS

16

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

slide-22
SLIDE 22

CMU 15-721 (Spring 2017)

RADIX PARTITIONING

Scan the input relation multiple times to generate the partitions. Multi-step pass over the relation:

→ Step #1: Scan R and compute a histogram of the # of tuples per hash key for the radix at some offset. → Step #2: Use this histogram to determine output offsets by computing the prefix sum. → Step #3: Scan R again and partition them according to the hash key.

17

slide-23
SLIDE 23

CMU 15-721 (Spring 2017)

RADIX

The radix is the value of an integer at a particular position (using its base).

18

89 12 23 08 41 64 Input

slide-24
SLIDE 24

CMU 15-721 (Spring 2017)

RADIX

The radix is the value of an integer at a particular position (using its base).

18

89 12 23 08 41 64 9 2 3 8 1 4 Input Radix

slide-25
SLIDE 25

CMU 15-721 (Spring 2017)

RADIX

The radix is the value of an integer at a particular position (using its base).

18

89 12 23 08 41 64 Input Radix 8 1 2 0 4 6

slide-26
SLIDE 26

CMU 15-721 (Spring 2017)

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

19

1 2 3 4 5 6 Input

slide-27
SLIDE 27

CMU 15-721 (Spring 2017)

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

19

1 2 3 4 5 6 1 Input Prefix Sum

slide-28
SLIDE 28

CMU 15-721 (Spring 2017)

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

19

+ 1 2 3 4 5 6 1 3 Input Prefix Sum

slide-29
SLIDE 29

CMU 15-721 (Spring 2017)

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

19

+ + + + + 1 2 3 4 5 6 1 3 6 10 15 21 Input Prefix Sum

slide-30
SLIDE 30

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-31
SLIDE 31

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-32
SLIDE 32

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-33
SLIDE 33

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Step #1: Inspect input, create histograms

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-34
SLIDE 34

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #2: Compute output

  • ffsets

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-35
SLIDE 35

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 3

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-36
SLIDE 36

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-37
SLIDE 37

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 1

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-38
SLIDE 38

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 1

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-39
SLIDE 39

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-40
SLIDE 40

CMU 15-721 (Spring 2017)

RADIX PARTITIONS

20

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

Source: Spyros Blanas

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

slide-41
SLIDE 41

CMU 15-721 (Spring 2017)

BUILD PHASE

The threads are then to scan either the tuples (or partitions) of R. For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

→ The buckets should only be a few cache lines in size. → The hash function must be different than the one that was used in the partition phase.

21

slide-42
SLIDE 42

CMU 15-721 (Spring 2017)

HASH FUNCTIONS

We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate.

22

slide-43
SLIDE 43

CMU 15-721 (Spring 2017)

HASH FUNCTIONS

MurmurHash (2008)

→ Designed to a fast, general purpose hash function.

Google CityHash (2011)

→ Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes).

Google FarmHash (2014)

→ Newer version of CityHash with better collision rates.

CLHash (2016)

→ Fast hashing function based on carry-less multiplication.

23

slide-44
SLIDE 44

CMU 15-721 (Spring 2017)

HASH FUNCTION BENCHMARKS

24

4000 8000 12000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Xeon CPU E5-2630v4 @ 2.20GHz

32 64 128 192

slide-45
SLIDE 45

CMU 15-721 (Spring 2017)

HASH FUNCTION BENCHMARKS

25

6000 12000 18000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Xeon CPU E5-2630v4 @ 2.20GHz

32 64 128 192

slide-46
SLIDE 46

CMU 15-721 (Spring 2017)

HASH TABLE IMPLEMENTATIONS

Approach #1: Chained Hash Table Approach #2: Open-Addressing Hash Table Approach #3: Cuckoo Hash Table

26

slide-47
SLIDE 47

CMU 15-721 (Spring 2017)

CHAINED HASH TABLE

Maintain a linked list of “buckets” for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket.

→ To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups.

27

slide-48
SLIDE 48

CMU 15-721 (Spring 2017)

CHAINED HASH TABLE

28

Ø

hashB(key) ⋮ ⋮

slide-49
SLIDE 49

CMU 15-721 (Spring 2017)

OPEN-ADDRESSING HASH TABLE

Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.

→ To determine whether an element is present, hash to a location in the table and scan for it. → Have to store the key in the table to know when to stop scanning. → Insertions and deletions are generalizations of lookups.

29

slide-50
SLIDE 50

CMU 15-721 (Spring 2017)

OPEN-ADDRESSING HASH TABLE

30

X Y Z hashB(key) ⋮ ⋮ | X

hashB(X)

slide-51
SLIDE 51

CMU 15-721 (Spring 2017)

OPEN-ADDRESSING HASH TABLE

30

X Y Z hashB(key) ⋮ ⋮ | X

hashB(X)

| Y

hashB(Y)

slide-52
SLIDE 52

CMU 15-721 (Spring 2017)

OPEN-ADDRESSING HASH TABLE

30

X Y Z hashB(key) ⋮ ⋮ | X

hashB(X)

| Y

hashB(Y)

slide-53
SLIDE 53

CMU 15-721 (Spring 2017)

OPEN-ADDRESSING HASH TABLE

30

X Y Z hashB(key) ⋮ ⋮ | X

hashB(X)

| Y

hashB(Y)

| Z

hashB(Z)

slide-54
SLIDE 54

CMU 15-721 (Spring 2017)

OBSERVATION

To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys. This requires a chained hash table with ~2x the number of slots as the # of elements in R.

31

slide-55
SLIDE 55

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

Use multiple hash tables with different hash functions.

→ On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

Look-ups and deletions are always O(1) because

  • nly one location per hash table is checked.

32

slide-56
SLIDE 56

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮

slide-57
SLIDE 57

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

slide-58
SLIDE 58

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

X

slide-59
SLIDE 59

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

X

slide-60
SLIDE 60

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

X Y

slide-61
SLIDE 61

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

X Y Insert Z

hashB1(Z) hashB2(Z)

slide-62
SLIDE 62

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

X Insert Z

hashB1(Z) hashB2(Z)

Z

slide-63
SLIDE 63

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

Insert Z

hashB1(Z) hashB2(Z)

Z

hashB1(Y)

Y

slide-64
SLIDE 64

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

33

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hashB1(X) hashB2(X)

Insert Y

hashB1(Y) hashB2(Y)

Insert Z

hashB1(Z) hashB2(Z)

Z

hashB1(Y)

Y

hashB2(X)

X

slide-65
SLIDE 65

CMU 15-721 (Spring 2017)

CUCKOO HASH TABLE

We have to make sure that we don’t get stuck in an infinite loop when moving keys. If we find a cycle, then we can rebuild the entire hash tables with new hash functions.

→ With two hash functions, we (probably) won’t need to rebuild the table until it is at about 50% full. → With three hash functions, we (probably) won’t need to rebuild the table until it is at about 90% full.

34

slide-66
SLIDE 66

CMU 15-721 (Spring 2017)

PROBE PHASE

For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

→ If inputs were partitioned, then assign each thread a unique partition. → Otherwise, synchronize their access to the cursor on S

35

slide-67
SLIDE 67

CMU 15-721 (Spring 2017)

HASH JOIN VARIANTS

No Partitioning + Shared Hash Table Non-Blocking Partitioning + Shared Buffers Non-Blocking Partitioning + Private Buffers Blocking (Radix) Partitioning

36

slide-68
SLIDE 68

CMU 15-721 (Spring 2017)

HASH JOIN VARIANTS

37

No-P Shared-P Private-P Radix

Partitioning

No Yes Yes Yes

Input scans

1 1 2

Sync during partitioning

– Spinlock per tuple Barrier,

  • nce at end

Barrier, 4 * #passes

Hash table

Shared Private Private Private

Sync during build phase

Yes No No No

Sync during probe phase

No No No No

slide-69
SLIDE 69

CMU 15-721 (Spring 2017)

BENCHMARKS

Primary key – foreign key join

→ Outer Relation (Build): 16M tuples, 16 bytes each → Inner Relation (Probe): 256M tuples, 16 bytes each

Uniform and highly skewed (Zipf; s=1.25) No output materialization

38

DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI-CORE CPUS SIGMOD 2011

slide-70
SLIDE 70

CMU 15-721 (Spring 2017)

HASH JOIN – UNIFORM DATA SET

39

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

60.2 67.6 76.8 47.3

Source: Spyros Blanas

slide-71
SLIDE 71

CMU 15-721 (Spring 2017)

HASH JOIN – UNIFORM DATA SET

39

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

60.2 67.6 76.8 47.3

24% faster than No Partitioning 3.3x cache misses 70x TLB misses

Source: Spyros Blanas

slide-72
SLIDE 72

CMU 15-721 (Spring 2017)

HASH JOIN – SKEWED DATA SET

40

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

25.2 167.1 56.5 50.7

Source: Spyros Blanas

slide-73
SLIDE 73

CMU 15-721 (Spring 2017)

OBSERVATION

We have ignored a lot of important parameters for all of these algorithms so far.

→ Whether to use partitioning or not? → How many partitions to use? → How many passes to take in partitioning phase?

In a real DBMS, the optimizer will select what it thinks are good values based on what it knows about the data (and maybe hardware).

41

slide-74
SLIDE 74

CMU 15-721 (Spring 2017)

RADIX HASH JOIN – UNIFORM DATA SET

42

40 80 120

64 256 512 1024 4096 8192 32768 131072 64 256 512 1024 4096 8192 32768 131072 Radix / 1-Pass Radix / 2-Pass

Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz Varying the # of Partitions

No Partitioning

Source: Spyros Blanas

+24%

  • 5%
slide-75
SLIDE 75

CMU 15-721 (Spring 2017)

RADIX HASH JOIN – UNIFORM DATA SET

43

40 80 120

64 256 512 1024 4096 8192 32768 131072 64 256 512 1024 4096 8192 32768 131072 Radix / 1-Pass Radix / 2-Pass

Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz Varying the # of Partitions

No Partitioning

Source: Spyros Blanas

slide-76
SLIDE 76

CMU 15-721 (Spring 2017)

EFFECTS OF HYPER-THREADING

44

1 3 5 7 9 11 1 3 5 7 9 11

Speedup Threads

No Partitioning Radix Ideal

Hyper-Threading

Intel Xeon CPU X5650 @ 2.66GHz Uniform Data Set

Source: Spyros Blanas

slide-77
SLIDE 77

CMU 15-721 (Spring 2017)

EFFECTS OF HYPER-THREADING

44

1 3 5 7 9 11 1 3 5 7 9 11

Speedup Threads

No Partitioning Radix Ideal

Hyper-Threading

Multi-threading hides cache & TLB miss latency.

Intel Xeon CPU X5650 @ 2.66GHz Uniform Data Set

Source: Spyros Blanas

slide-78
SLIDE 78

CMU 15-721 (Spring 2017)

EFFECTS OF HYPER-THREADING

44

1 3 5 7 9 11 1 3 5 7 9 11

Speedup Threads

No Partitioning Radix Ideal

Hyper-Threading

Radix join has fewer cache & TLB misses but this has marginal benefit. Multi-threading hides cache & TLB miss latency.

Intel Xeon CPU X5650 @ 2.66GHz Uniform Data Set

Source: Spyros Blanas

slide-79
SLIDE 79

CMU 15-721 (Spring 2017)

EFFECTS OF HYPER-THREADING

44

1 3 5 7 9 11 1 3 5 7 9 11

Speedup Threads

No Partitioning Radix Ideal

Hyper-Threading

Non-partitioned join relies on multi-threading for high performance. Radix join has fewer cache & TLB misses but this has marginal benefit. Multi-threading hides cache & TLB miss latency.

Intel Xeon CPU X5650 @ 2.66GHz Uniform Data Set

Source: Spyros Blanas

slide-80
SLIDE 80

CMU 15-721 (Spring 2017)

PARTING THOUGHTS

On modern CPUs, a simple hash join algorithm that does not partition inputs is competitive. There are additional vectorization execution

  • ptimizations that are possible in hash joins that

we didn’t talk about. But these don’t really help…

45

slide-81
SLIDE 81

CMU 15-721 (Spring 2017)

NEXT CLASS

Parallel Sort-Merge Joins

46