15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join - - PowerPoint PPT Presentation

15 721
SMART_READER_LITE
LIVE PREVIEW

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join - - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join Algorithms (Sorting) Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background SIMD


slide-1
SLIDE 1

Andy Pavlo / / Carnegie Mellon University / / Spring 2016

ADVANCED

DATABASE SYSTEMS

Lecture #19 – Parallel Join Algorithms (Sorting)

15-721

@Andy_Pavlo // Carnegie Mellon University // Spring 2017

slide-2
SLIDE 2

CMU 15-721 (Spring 2017)

TODAY’S AGENDA

Background SIMD Parallel Sort-Merge Join Evaluation

2

SPOILER: This doesn’t work

  • n current Xeon CPUs.
slide-3
SLIDE 3

CMU 15-721 (Spring 2017)

SINGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. Both current AMD and Intel CPUs have ISA and microarchitecture support SIMD operations.

→ MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX

3

slide-4
SLIDE 4

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-5
SLIDE 5

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-6
SLIDE 6

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-7
SLIDE 7

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-8
SLIDE 8

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-9
SLIDE 9

CMU 15-721 (Spring 2017)

Z

SIMD EXAMPLE

4

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-10
SLIDE 10

CMU 15-721 (Spring 2017)

SIMD TRADE-OFFS

Advantages:

→ Significant performance gains and resource utilization if an algorithm can be vectorized.

Disadvantages:

→ Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.

5

slide-11
SLIDE 11

CMU 15-721 (Spring 2017)

WHY NOT GPUS?

Moving data back and forth between DRAM and GPU is slow over PCI-E bus. There are some newer GPU-enabled DBMSs

→ Examples: MapD, SQream, Kinetica

Emerging co-processors that can share CPU’s memory may change this.

→ Examples: AMD’s APU, Intel’s Knights Landing

6

slide-12
SLIDE 12

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

Phase #1: Sort

→ Sort the tuples of R and S based on the join key.

Phase #2: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

7

slide-13
SLIDE 13

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

8

Relation R Relation S

slide-14
SLIDE 14

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

8

Relation R Relation S SORT! SORT!

slide-15
SLIDE 15

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

8

Relation R Relation S

SORT! SORT! MERGE!

slide-16
SLIDE 16

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

8

Relation R Relation S

SORT! SORT! MERGE!

slide-17
SLIDE 17

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN (R⨝S)

8

Relation R Relation S

SORT! SORT! MERGE!

slide-18
SLIDE 18

CMU 15-721 (Spring 2017)

PARALLEL SORT-MERGE JOINS

Sorting is always the most expensive part. Take advantage of new hardware to speed things up as much as possible.

→ Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries.

9

MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013

slide-19
SLIDE 19

CMU 15-721 (Spring 2017)

PARALLEL SORT-MERGE JOIN (R⨝S)

Phase #1: Partitioning (optional)

→ Partition R and assign them to workers / cores.

Phase #2: Sort

→ Sort the tuples of R and S based on the join key.

Phase #3: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

10

slide-20
SLIDE 20

CMU 15-721 (Spring 2017)

PARTITIONING PHASE

Divide the relations into chunks and assign them to cores.

→ Explicit vs. Implicit

Explicit: Divide only the outer relation and redistribute among the different CPU cores.

→ Can use the same radix partitioning approach we talked about last time.

11

slide-21
SLIDE 21

CMU 15-721 (Spring 2017)

SORT PHASE

Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful…

12

slide-22
SLIDE 22

CMU 15-721 (Spring 2017)

CACHE-CONSCIOUS SORTING

Level #1: In-Register Sorting

→ Sort runs that fit into CPU registers.

Level #2: In-Cache Sorting

→ Merge the output of Level #1 into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size.

Level #3: Out-of-Cache Sorting

→ Used when the runs of Level #2 exceed the size of caches.

13

slide-23
SLIDE 23

CMU 15-721 (Spring 2017)

CACHE-CONSCIOUS SORTING

14

Level #1 Level #2 Level #3

SORTED UNSORTED

slide-24
SLIDE 24

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6

Input Output

slide-25
SLIDE 25

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6

Input Output

slide-26
SLIDE 26

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 5 9

Input Output

slide-27
SLIDE 27

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9

Input Output

slide-28
SLIDE 28

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9

Input Output

slide-29
SLIDE 29

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9 5 3

Input Output

slide-30
SLIDE 30

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9 5 3

Input Output

3

slide-31
SLIDE 31

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9 5 3

Input Output

3

slide-32
SLIDE 32

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9 9 6 5 3

Input Output

3 9

slide-33
SLIDE 33

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

15

9 5 3 6 3 6 5 9 9 6 5 3 5 6

Input Output

3 5 6 9

slide-34
SLIDE 34

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

16

12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10

Instructions:

→ 4 LOAD

slide-35
SLIDE 35

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

16

12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 1 8 3 0 5 11 4 7 9 14 6 10 12 21 15 13

Sort Across Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-36
SLIDE 36

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

16

12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 1 8 3 0 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 0 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-37
SLIDE 37

CMU 15-721 (Spring 2017)

LEVEL #1 – SORTING NETWORKS

16

12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 1 8 3 0 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 0 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

Instructions:

→ 8 SHUFFLE → 4 STORE

slide-38
SLIDE 38

CMU 15-721 (Spring 2017)

LEVEL #2 – BITONIC MERGE NETWORK

Like a Sorting Network but it can merge two locally-sorted lists into a globally-sorted list. Can expand network to merge progressively larger lists (½ cache size). Intel’s Measurements

→ 2.25–3.5x speed-up over SISD implementation.

17

EFFICIENT IMPLEMENTATION OF SORTING ON MULTI-CORE VLDB 2008

slide-39
SLIDE 39

CMU 15-721 (Spring 2017)

LEVEL #2 – BITONIC MERGE NETWORK

18

Input Output

b4 b3 b2 b1

Sorted Run Reverse Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

Sorted Run

min/max min/max min/max

slide-40
SLIDE 40

CMU 15-721 (Spring 2017)

LEVEL #3 – MULTI-WAY MERGING

Use the Bitonic Merge Networks but split the process up into tasks.

→ Still one worker thread per core. → Link together tasks with a cache-sized FIFO queue.

A task blocks when either its input queue is empty

  • r its output queue is full.

Requires more CPU instructions, but brings bandwidth and compute into balance.

19

slide-41
SLIDE 41

CMU 15-721 (Spring 2017)

Sorted Runs

LEVEL #3 – MULTI-WAY MERGING

20

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

slide-42
SLIDE 42

CMU 15-721 (Spring 2017)

MERGE PHASE

Iterate through the outer table and inner table in lockstep and compare join keys. May need to backtrack if there are duplicates. Can be done in parallel at the different cores without synchronization if there are separate

  • utput buffers.

21

slide-43
SLIDE 43

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN VARIANTS

Multi-Way Sort-Merge (M-WAY) Multi-Pass Sort-Merge (M-PASS) Massively Parallel Sort-Merge (MPSM)

22

slide-44
SLIDE 44

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

Outer Table

→ Each core sorts in parallel on local data (levels #1/#2). → Redistribute sorted runs across cores using the multi- way merge (level #3).

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

  • f outer/inner tables at each core.

23

MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013

slide-45
SLIDE 45

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

Local-NUMA Partitioning

slide-46
SLIDE 46

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

Local-NUMA Partitioning Sort

slide-47
SLIDE 47

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

Local-NUMA Partitioning Sort Multi-Way Merge

slide-48
SLIDE 48

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

Local-NUMA Partitioning Sort Multi-Way Merge

slide-49
SLIDE 49

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

SORT! SORT! SORT! SORT!

Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table

slide-50
SLIDE 50

CMU 15-721 (Spring 2017)

MULTI-WAY SORT-MERGE

24

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table

slide-51
SLIDE 51

CMU 15-721 (Spring 2017)

MULTI-PASS SORT-MERGE

Outer Table

→ Same level #1/#2 sorting as M-WAY. → But instead of redistributing, it uses a multi-pass naïve merge on sorted runs.

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

  • f outer table and inner table.

25

MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013

slide-52
SLIDE 52

CMU 15-721 (Spring 2017)

MASSIVELY PARALLEL SORT-MERGE

Outer Table

→ Range-partition outer table and redistribute to cores. → Each core sorts in parallel on their partitions.

Inner Table

→ Not redistributed like outer table. → Each core sorts its local data.

Merge phase is between entire sorted run of outer table and a segment of inner table.

26

MASSIVELY PARALLEL SORT-MERGE JOINS IN MAIN MEMORY MULTI-CORE DATABASE SYSTEMS VLDB 2012

slide-53
SLIDE 53

CMU 15-721 (Spring 2017)

MASSIVELY PARALLEL SORT-MERGE

27

Cross-NUMA Partitioning Sort

slide-54
SLIDE 54

CMU 15-721 (Spring 2017)

MASSIVELY PARALLEL SORT-MERGE

27

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort

slide-55
SLIDE 55

CMU 15-721 (Spring 2017)

MASSIVELY PARALLEL SORT-MERGE

27

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-56
SLIDE 56

CMU 15-721 (Spring 2017)

MASSIVELY PARALLEL SORT-MERGE

27

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-57
SLIDE 57

CMU 15-721 (Spring 2017)

HYPER’s RULES FOR PARALLELIZATION

Rule #1: No random writes to non-local memory

→ Chunk the data, redistribute, and then each core sorts/works on local data.

Rule #2: Only perform sequential reads on non-local memory

→ This allows the hardware prefetcher to hide remote access latency.

Rule #3: No core should ever wait for another

→ Avoid fine-grained latching or sync barriers.

28

Source: Martina-Cezara Albutiu

slide-58
SLIDE 58

CMU 15-721 (Spring 2017)

EVALUATION

Compare the different join algorithms using a synthetic data set.

→ Sort-Merge: M-WAY, M-PASS, MPSM → Hash: Radix Partitioning

Hardware:

→ 4 Socket Intel Xeon E4640 @ 2.4GHz → 8 Cores with 2 Threads Per Core → 512 GB of DRAM

29

slide-59
SLIDE 59

CMU 15-721 (Spring 2017)

RAW SORTING PERFORMANCE

30

9 18 27 36 1 2 4 8 16 32 64 128 256

Throughput (M Tuples/sec) Number of Tuples (in 220)

C++ STL Sort SIMD Sort

Source: Cagri Balkesen

Single-threaded sorting performance

2.5–3x Faster

slide-60
SLIDE 60

CMU 15-721 (Spring 2017)

COMPARISON OF SORT-MERGE JOINS

31

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

slide-61
SLIDE 61

CMU 15-721 (Spring 2017)

Hyper- Threading

M-WAY JOIN VS. MPSM JOIN

32

100 200 300 400 1 2 4 8 16 32 64

Throughput (M Tuples/sec) Number of Threads

Multi-Way Massively Parallel 105 M/sec 315 M/sec

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

slide-62
SLIDE 62

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN VS. HASH JOIN

33

2 4 6 8

SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B

Cycles / Output Tuple

Partition Sort S-Merge M-Join Build+Probe

Source: Cagri Balkesen

4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core

slide-63
SLIDE 63

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN VS. HASH JOIN

33

2 4 6 8

SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B

Cycles / Output Tuple

Partition Sort S-Merge M-Join Build+Probe

Source: Cagri Balkesen

4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core

slide-64
SLIDE 64

CMU 15-721 (Spring 2017)

SORT-MERGE JOIN VS. HASH JOIN

34

150 300 450 600 750 128 256 384 512 768 1024 1280 1536 1792 1920

Throughput (M Tuples/sec) Millions of Tuples

Multi-Way Sort-Merge Join Radix Hash Join

Source: Cagri Balkesen

Varying the size of the input relations

slide-65
SLIDE 65

CMU 15-721 (Spring 2017)

WHY DOESN’T ANY OF THIS WORK?

The DBMS has to sort values with their corresponding 64-bit tuple Ids. Since we have to align our data in SIMD, that means we need to sort 128-bit values. Intel Xeon (not Phi) only supports AVX-256. That means we can only store two values in a 256- bit SIMD register. Sort Networks need four!

35

slide-66
SLIDE 66

CMU 15-721 (Spring 2017)

PARTING THOUGHTS

Both join approaches are equally important. Every serious OLAP DBMS supports both. We did not consider the impact of queries where the output needs to be sorted.

36

slide-67
SLIDE 67

CMU 15-721 (Spring 2017)

NEXT CLASS

Query Code Generation + Compilation Reminder: First Code Review April 11th @ 11:59pm

37