DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #22: PARALLEL JOIN ALGORITHMS (SORTING) 2 ADMINISTRIVIA Code Review Submission: April 14 th Code Review Updates: April 21st Final Presentation: April 25th 3
ADMINISTRIVIA
Code Review Submission: April 14th Code Review Updates: April 21st Final Presentation: April 25th
2
TODAY’S AGENDA
SIMD Background Parallel Sort-Merge Join Evaluation Code Review Guidelines
3
SINGLE INSTRUCTION, MULTIPLE DATA
A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.
→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX → PowerPC: Altivec → ARM: NEON
4
SIMD EXAMPLE
5
X + Y = Z x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
6
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
7
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
8
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
9
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
10
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
11
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
SIMD
+
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
12
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
13
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
14
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
15
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
16
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 1 1 1 1 1 1 1 1
Y
SIMD
+
4 3 2 1 1 1 1 1
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
17
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1
Y
SIMD
+
4 3 2 1 1 1 1 1
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
SIMD TRADE-OFFS
Advantages:
→ Significant performance gains and resource utilization if an algorithm can be vectorized.
Disadvantages:
→ Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.
18
SORT-MERGE JOIN (R⨝S)
Phase #1: Sort
→ Sort the tuples of R and S based on the join key.
Phase #2: Merge
→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.
19
SORT-MERGE JOIN (R⨝S)
20
Relation R Relation S
SORT-MERGE JOIN (R⨝S)
21
Relation R Relation S SORT! SORT!
SORT-MERGE JOIN (R⨝S)
22
Relation R Relation S
⨝
SORT! SORT! MERGE!
SORT-MERGE JOIN (R⨝S)
23
Relation R Relation S
⨝
SORT! SORT! MERGE!
SORT-MERGE JOIN (R⨝S)
24
Relation R Relation S
⨝
SORT! SORT! MERGE!
PARALLEL SORT-MERGE JOINS
Sorting is always the most expensive part. Use hardware correctly to speed up the join algorithm as much as possible.
→ Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries. → Use SIMD instructions where applicable.
25
MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013
PARALLEL SORT-MERGE JOIN (R⨝S)
Phase #1: Partitioning (optional)
→ Partition R and assign them to workers / cores.
Phase #2: Sort
→ Sort the tuples of R and S based on the join key.
Phase #3: Merge
→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.
26
PARTITIONING PHASE
Approach #1: Implicit Partitioning
→ The data was partitioned on the join key when it was loaded into the database. → No extra pass over the data is needed.
Approach #2: Explicit Partitioning
→ Divide only the outer relation and redistribute among the different CPU cores. → Can use the same radix partitioning approach we talked about last time.
27
SORT PHASE
Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful…
28
CACHE-CONSCIOUS SORTING
Level #1: In-Register Sorting
→ Sort runs that fit into CPU registers.
Level #2: In-Cache Sorting
→ Merge Level #1 output into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size.
Level #3: Out-of-Cache Sorting
→ Used when the runs of Level #2 exceed the size of caches.
29
SO SORT VS.
- VS. HASH
SH REVI VISI SITED: FAST ST JOIN IM IMPLEMENTATIO ION ON MODERN MULTI-CO CORE CP CPUS VLDB 2009
CACHE-CONSCIOUS SORTING
30
UNSORTED
CACHE-CONSCIOUS SORTING
31
Level #1
UNSORTED
CACHE-CONSCIOUS SORTING
32
Level #1
UNSORTED
CACHE-CONSCIOUS SORTING
33
Level #1 Level #2
UNSORTED
CACHE-CONSCIOUS SORTING
34
Level #1 Level #2
UNSORTED
CACHE-CONSCIOUS SORTING
35
Level #1 Level #2 Level #3
UNSORTED
CACHE-CONSCIOUS SORTING
36
Level #1 Level #2 Level #3
SORTED UNSORTED
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
37
9 5 3 6
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
38
9 5 3 6
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
39
9 5 3 6 5 9
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
40
9 5 3 6 3 6 5 9
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
41
9 5 3 6 3 6 5 9
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
42
9 5 3 6 3 6 5 9 5 3
Input Output
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
43
9 5 3 6 3 6 5 9 5 3
Input Output
3
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
44
9 5 3 6 3 6 5 9 5 3
Input Output
3
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
45
9 5 3 6 3 6 5 9 9 6 5 3
Input Output
3 9
LEVEL #1 – SORTING NETWORKS
Abstract model for sorting keys.
→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.
46
9 5 3 6 3 6 5 9 9 6 5 3 5 6
Input Output
3 5 6 9
LEVEL #1 – SORTING NETWORKS
47
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10
LEVEL #1 – SORTING NETWORKS
48
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10
Instructions:
→ 4 LOAD
LEVEL #1 – SORTING NETWORKS
49
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10
Sort Across Registers Instructions:
→ 4 LOAD
LEVEL #1 – SORTING NETWORKS
50
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10
Sort Across Registers Instructions:
→ 4 LOAD
LEVEL #1 – SORTING NETWORKS
51
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13
Sort Across Registers Instructions:
→ 4 LOAD
LEVEL #1 – SORTING NETWORKS
52
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13
Sort Across Registers Instructions:
→ 4 LOAD
Instructions:
→ 10 MIN/MAX
LEVEL #1 – SORTING NETWORKS
53
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13
Sort Across Registers Transpose Registers Instructions:
→ 4 LOAD
Instructions:
→ 10 MIN/MAX
LEVEL #1 – SORTING NETWORKS
54
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13
Sort Across Registers Transpose Registers Instructions:
→ 4 LOAD
Instructions:
→ 10 MIN/MAX
LEVEL #1 – SORTING NETWORKS
55
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13
Sort Across Registers Transpose Registers Instructions:
→ 4 LOAD
Instructions:
→ 10 MIN/MAX
LEVEL #1 – SORTING NETWORKS
56
12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13
Sort Across Registers Transpose Registers Instructions:
→ 4 LOAD
Instructions:
→ 10 MIN/MAX
Instructions:
→ 8 SHUFFLE → 4 STORE
LEVEL #2 – BITONIC MERGE NETWORK
Like a Sorting Network but it can merge two locally-sorted lists into a globally-sorted list. Can expand network to merge progressively larger lists (½ cache size). Intel’s Measurements
→ 2.25–3.5x speed-up over SISD implementation.
57
EFFI EFFICIENT ENT IMPLEM EMENT ENTATION N OF F SORT RTING NG ON ON MULTI-CO CORE VLDB 2008
LEVEL #2 – BITONIC MERGE NETWORK
58
Input Output
b4 b3 b2 b1 a1 a2 a3 a4
S H U F F L E S H U F F L E
LEVEL #2 – BITONIC MERGE NETWORK
59
Input Output
b4 b3 b2 b1
Sorted Run
a1 a2 a3 a4
S H U F F L E S H U F F L E
LEVEL #2 – BITONIC MERGE NETWORK
60
Input Output
b4 b3 b2 b1
Sorted Run Reverse Sorted Run
a1 a2 a3 a4
S H U F F L E S H U F F L E
LEVEL #2 – BITONIC MERGE NETWORK
61
Input Output
b4 b3 b2 b1
Sorted Run Reverse Sorted Run
a1 a2 a3 a4
S H U F F L E S H U F F L E
min/max min/max min/max
LEVEL #2 – BITONIC MERGE NETWORK
62
Input Output
b4 b3 b2 b1
Sorted Run Reverse Sorted Run
a1 a2 a3 a4
S H U F F L E S H U F F L E
Sorted Run
min/max min/max min/max
LEVEL #3 – MULTI-WAY MERGING
Use the Bitonic Merge Networks but split the process up into tasks.
→ Still one worker thread per core. → Link together tasks with a cache-sized FIFO queue.
A task blocks when either its input queue is empty
- r its output queue is full.
Requires more CPU instructions, but brings bandwidth and compute into balance.
63
Sorted Runs
LEVEL #3 – MULTI-WAY MERGING
64
MERGE MERGE MERGE MERGE MERGE MERGE MERGE
Cache-Sized Queue
Sorted Runs
LEVEL #3 – MULTI-WAY MERGING
65
MERGE MERGE MERGE MERGE MERGE MERGE MERGE
Cache-Sized Queue
Sorted Runs
LEVEL #3 – MULTI-WAY MERGING
66
MERGE MERGE MERGE MERGE MERGE MERGE MERGE
Cache-Sized Queue
MERGE PHASE
Iterate through the outer table and inner table in lockstep and compare join keys. May need to backtrack if there are duplicates. Can be done in parallel at the different cores without synchronization if there are separate
- utput buffers.
67
SORT-MERGE JOIN VARIANTS
Multi-Way Sort-Merge (M-WAY) Multi-Pass Sort-Merge (M-PASS) Massively Parallel Sort-Merge (MPSM)
68
MULTI-WAY SORT-MERGE
Outer Table
→ Each core sorts in parallel on local data (levels #1/#2). → Redistribute sorted runs across cores using the multi- way merge (level #3).
Inner Table
→ Same as outer table.
Merge phase is between matching pairs of chunks
- f outer/inner tables at each core.
69
MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS.
- VS. HASH
SH REVI VISI SITED VLDB 2013
MULTI-WAY SORT-MERGE
70
MULTI-WAY SORT-MERGE
71
Local-NUMA Partitioning
MULTI-WAY SORT-MERGE
72
Local-NUMA Partitioning
MULTI-WAY SORT-MERGE
73
Local-NUMA Partitioning Sort
MULTI-WAY SORT-MERGE
74
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
75
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
76
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
77
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
78
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
79
Local-NUMA Partitioning Sort Multi-Way Merge
MULTI-WAY SORT-MERGE
80
Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table
MULTI-WAY SORT-MERGE
81
SORT! SORT! SORT! SORT!
Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table
MULTI-WAY SORT-MERGE
82
SORT! SORT! SORT! SORT!
Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table
MULTI-WAY SORT-MERGE
83
SORT! SORT! SORT! SORT!
⨝ ⨝ ⨝ ⨝
Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table
MULTI-WAY SORT-MERGE
84
SORT! SORT! SORT! SORT!
⨝ ⨝ ⨝ ⨝
Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table
MULTI-PASS SORT-MERGE
Outer Table
→ Same level #1/#2 sorting as Multi-Way. → But instead of redistributing, it uses a multi-pass naïve merge on sorted runs.
Inner Table
→ Same as outer table.
Merge phase is between matching pairs of chunks
- f outer table and inner table.
85
MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS.
- VS. HASH
SH REVI VISI SITED VLDB 2013
MASSIVELY PARALLEL SORT-MERGE
Outer Table
→ Range-partition outer table and redistribute to cores. → Each core sorts in parallel on their partitions.
Inner Table
→ Not redistributed like outer table. → Each core sorts its local data.
Merge phase is between entire sorted run of outer table and a segment of inner table.
86
MA MASSIVELY PARALLEL SORT-ME MERGE JOINS IN MA MAIN ME MEMO MORY MU MULTI-CO CORE DATABA BASE SYSTEMS VLDB 2012
MASSIVELY PARALLEL SORT-MERGE
87
MASSIVELY PARALLEL SORT-MERGE
88
Cross-NUMA Partitioning
MASSIVELY PARALLEL SORT-MERGE
89
Cross-NUMA Partitioning
MASSIVELY PARALLEL SORT-MERGE
90
Cross-NUMA Partitioning Sort
MASSIVELY PARALLEL SORT-MERGE
91
Cross-NUMA Partitioning Sort
MASSIVELY PARALLEL SORT-MERGE
92
SORT! SORT! SORT! SORT!
Cross-NUMA Partitioning Sort
MASSIVELY PARALLEL SORT-MERGE
93
SORT! SORT! SORT! SORT!
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
94
SORT! SORT! SORT! SORT!
⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
95
SORT! SORT! SORT! SORT!
⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
96
SORT! SORT! SORT! SORT!
⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
97
SORT! SORT! SORT! SORT!
⨝ ⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
98
SORT! SORT! SORT! SORT!
⨝ ⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
MASSIVELY PARALLEL SORT-MERGE
99
SORT! SORT! SORT! SORT!
⨝ ⨝ ⨝ ⨝
Cross-NUMA Partitioning Sort Cross-Partition Merge Join
HYPER’s RULES FOR PARALLELIZATION
Rule #1: No random writes to non-local memory
→ Chunk the data, redistribute, and then each core sorts/works on local data.
Rule #2: Only perform sequential reads on non-local memory
→ This allows the hardware prefetcher to hide remote access latency.
Rule #3: No core should ever wait for another
→ Avoid fine-grained latching or sync barriers.
100
Source: Martina-Cezara Albutiu
EVALUATION
Compare the different join algorithms using a synthetic data set.
→ Sort-Merge: M-WAY, M-PASS, MPSM → Hash: Radix Partitioning
Hardware:
→ 4 Socket Intel Xeon E4640 @ 2.4GHz → 8 Cores with 2 Threads Per Core → 512 GB of DRAM
101
RAW SORTING PERFORMANCE
102
9 18 27 36 1 2 4 8 16 32 64 128 256
Throughput (M Tuples/sec) Number of Tuples (in 220)
C++ STL Sort SIMD Sort
Source: Cagri Balkesen
Single-threaded sorting performance
RAW SORTING PERFORMANCE
103
9 18 27 36 1 2 4 8 16 32 64 128 256
Throughput (M Tuples/sec) Number of Tuples (in 220)
C++ STL Sort SIMD Sort
Source: Cagri Balkesen
Single-threaded sorting performance
2.5–3x Faster
COMPARISON OF SORT-MERGE JOINS
104
100 200 300 400
5 10 15 20 25
M-WAY M-PASS MPSM
Throughput (M Tuples/sec) Cycles / Output Tuple
Partition Sort S-Merge M-Join Throughput
13.6
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
7.6 22.9
COMPARISON OF SORT-MERGE JOINS
105
100 200 300 400
5 10 15 20 25
M-WAY M-PASS MPSM
Throughput (M Tuples/sec) Cycles / Output Tuple
Partition Sort S-Merge M-Join Throughput
13.6
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
7.6 22.9
COMPARISON OF SORT-MERGE JOINS
106
100 200 300 400
5 10 15 20 25
M-WAY M-PASS MPSM
Throughput (M Tuples/sec) Cycles / Output Tuple
Partition Sort S-Merge M-Join Throughput
13.6
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
7.6 22.9
COMPARISON OF SORT-MERGE JOINS
107
100 200 300 400
5 10 15 20 25
M-WAY M-PASS MPSM
Throughput (M Tuples/sec) Cycles / Output Tuple
Partition Sort S-Merge M-Join Throughput
13.6
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
7.6 22.9
Hyper- Threading
M-WAY JOIN VS. MPSM JOIN
108
100 200 300 400 1 2 4 8 16 32 64
Throughput (M Tuples/sec) Number of Threads
Multi-Way Massively Parallel
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
Hyper- Threading
M-WAY JOIN VS. MPSM JOIN
109
100 200 300 400 1 2 4 8 16 32 64
Throughput (M Tuples/sec) Number of Threads
Multi-Way Massively Parallel 315 M/sec
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
130 M/sec 259 M/sec
Hyper- Threading
M-WAY JOIN VS. MPSM JOIN
110
100 200 300 400 1 2 4 8 16 32 64
Throughput (M Tuples/sec) Number of Threads
Multi-Way Massively Parallel 108 M/sec 315 M/sec
Source: Cagri Balkesen
Workload: 1.6B⋈ 128M (8-byte tuples)
130 M/sec 54 M/sec 259 M/sec 90 M/sec
SORT-MERGE JOIN VS. HASH JOIN
111
2 4 6 8
SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B
Cycles / Output Tuple
Partition Sort S-Merge M-Join Build+Probe
Source: Cagri Balkesen
4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core
SORT-MERGE JOIN VS. HASH JOIN
112
2 4 6 8
SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B
Cycles / Output Tuple
Partition Sort S-Merge M-Join Build+Probe
Source: Cagri Balkesen
4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core
SORT-MERGE JOIN VS. HASH JOIN
113
150 300 450 600 750 128 256 384 512 768 1024 1280 1536 1792 1920
Throughput (M Tuples/sec) Millions of Tuples
Multi-Way Sort-Merge Join Radix Hash Join
Source: Cagri Balkesen
Varying the size of the input relations
PARTING THOUGHTS
Both join approaches are equally important. Every serious OLAP DBMS supports both. We did not consider the impact of queries where the output needs to be sorted.
114
CODE REVIEWS
Each group will send a pull request to the target repos.
→ PR must be able to merge cleanly into master branch. → Reviewing group will write comments on that request. → Add the URL to the Google spreadsheet and notify the reviewing team that it is ready.
Please be helpful and courteous.
115
GENERAL TIPS
The dev team should provide you with a summary
- f what files/functions the reviewing team should
look at. Review fewer than 400 lines of code at a time and
- nly for at most 60 minutes.
Use a checklist to outline what kind of problems you are looking for.
116
Source: SmartBear
CHECKLIST – GENERAL
Does the code work? Is all the code easily understood? Is there any redundant or duplicate code? Is the code as modular as possible? Can any global variables be replaced? Is there any commented out code? Is it using proper debug log functions?
117
Source: Gareth Wilson
CHECKLIST – DOCUMENTATION
Do comments describe the intent of the code? Are all functions commented? Is any unusual behavior described? Is the use of 3rd-party libraries documented? Is there any incomplete code?
118
Source: Gareth Wilson
CHECKLIST – TESTING
Do tests exist and are they comprehensive? Are the tests actually testing the feature? Are they relying on hardcoded answers? What is the code coverage?
119
Source: Gareth Wilson
NEXT CLASS
Vectorized Execution Reminder: Code Review April 21st @ 11:59pm
120