DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #22: PARALLEL JOIN ALGORITHMS (SORTING) 2 ADMINISTRIVIA Code Review Submission: April 14 th Code Review Updates: April 21st Final Presentation: April 25th 3


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #22: PARALLEL JOIN ALGORITHMS (SORTING)

slide-2
SLIDE 2

ADMINISTRIVIA

Code Review Submission: April 14th Code Review Updates: April 21st Final Presentation: April 25th

2

slide-3
SLIDE 3

TODAY’S AGENDA

SIMD Background Parallel Sort-Merge Join Evaluation Code Review Guidelines

3

slide-4
SLIDE 4

SINGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.

→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX → PowerPC: Altivec → ARM: NEON

4

slide-5
SLIDE 5

SIMD EXAMPLE

5

X + Y = Z x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-6
SLIDE 6

Z

SIMD EXAMPLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-7
SLIDE 7

Z

SIMD EXAMPLE

7

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-8
SLIDE 8

Z

SIMD EXAMPLE

8

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-9
SLIDE 9

Z

SIMD EXAMPLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-10
SLIDE 10

Z

SIMD EXAMPLE

10

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-11
SLIDE 11

Z

SIMD EXAMPLE

11

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-12
SLIDE 12

Z

SIMD EXAMPLE

12

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-13
SLIDE 13

Z

SIMD EXAMPLE

13

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-14
SLIDE 14

Z

SIMD EXAMPLE

14

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-15
SLIDE 15

Z

SIMD EXAMPLE

15

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-16
SLIDE 16

Z

SIMD EXAMPLE

16

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-17
SLIDE 17

Z

SIMD EXAMPLE

17

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-18
SLIDE 18

SIMD TRADE-OFFS

Advantages:

→ Significant performance gains and resource utilization if an algorithm can be vectorized.

Disadvantages:

→ Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.

18

slide-19
SLIDE 19

SORT-MERGE JOIN (R⨝S)

Phase #1: Sort

→ Sort the tuples of R and S based on the join key.

Phase #2: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

19

slide-20
SLIDE 20

SORT-MERGE JOIN (R⨝S)

20

Relation R Relation S

slide-21
SLIDE 21

SORT-MERGE JOIN (R⨝S)

21

Relation R Relation S SORT! SORT!

slide-22
SLIDE 22

SORT-MERGE JOIN (R⨝S)

22

Relation R Relation S

SORT! SORT! MERGE!

slide-23
SLIDE 23

SORT-MERGE JOIN (R⨝S)

23

Relation R Relation S

SORT! SORT! MERGE!

slide-24
SLIDE 24

SORT-MERGE JOIN (R⨝S)

24

Relation R Relation S

SORT! SORT! MERGE!

slide-25
SLIDE 25

PARALLEL SORT-MERGE JOINS

Sorting is always the most expensive part. Use hardware correctly to speed up the join algorithm as much as possible.

→ Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries. → Use SIMD instructions where applicable.

25

MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013

slide-26
SLIDE 26

PARALLEL SORT-MERGE JOIN (R⨝S)

Phase #1: Partitioning (optional)

→ Partition R and assign them to workers / cores.

Phase #2: Sort

→ Sort the tuples of R and S based on the join key.

Phase #3: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

26

slide-27
SLIDE 27

PARTITIONING PHASE

Approach #1: Implicit Partitioning

→ The data was partitioned on the join key when it was loaded into the database. → No extra pass over the data is needed.

Approach #2: Explicit Partitioning

→ Divide only the outer relation and redistribute among the different CPU cores. → Can use the same radix partitioning approach we talked about last time.

27

slide-28
SLIDE 28

SORT PHASE

Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful…

28

slide-29
SLIDE 29

CACHE-CONSCIOUS SORTING

Level #1: In-Register Sorting

→ Sort runs that fit into CPU registers.

Level #2: In-Cache Sorting

→ Merge Level #1 output into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size.

Level #3: Out-of-Cache Sorting

→ Used when the runs of Level #2 exceed the size of caches.

29

SO SORT VS.

  • VS. HASH

SH REVI VISI SITED: FAST ST JOIN IM IMPLEMENTATIO ION ON MODERN MULTI-CO CORE CP CPUS VLDB 2009

slide-30
SLIDE 30

CACHE-CONSCIOUS SORTING

30

UNSORTED

slide-31
SLIDE 31

CACHE-CONSCIOUS SORTING

31

Level #1

UNSORTED

slide-32
SLIDE 32

CACHE-CONSCIOUS SORTING

32

Level #1

UNSORTED

slide-33
SLIDE 33

CACHE-CONSCIOUS SORTING

33

Level #1 Level #2

UNSORTED

slide-34
SLIDE 34

CACHE-CONSCIOUS SORTING

34

Level #1 Level #2

UNSORTED

slide-35
SLIDE 35

CACHE-CONSCIOUS SORTING

35

Level #1 Level #2 Level #3

UNSORTED

slide-36
SLIDE 36

CACHE-CONSCIOUS SORTING

36

Level #1 Level #2 Level #3

SORTED UNSORTED

slide-37
SLIDE 37

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

37

9 5 3 6

Input Output

slide-38
SLIDE 38

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

38

9 5 3 6

Input Output

slide-39
SLIDE 39

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

39

9 5 3 6 5 9

Input Output

slide-40
SLIDE 40

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

40

9 5 3 6 3 6 5 9

Input Output

slide-41
SLIDE 41

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

41

9 5 3 6 3 6 5 9

Input Output

slide-42
SLIDE 42

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

42

9 5 3 6 3 6 5 9 5 3

Input Output

slide-43
SLIDE 43

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

43

9 5 3 6 3 6 5 9 5 3

Input Output

3

slide-44
SLIDE 44

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

44

9 5 3 6 3 6 5 9 5 3

Input Output

3

slide-45
SLIDE 45

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

45

9 5 3 6 3 6 5 9 9 6 5 3

Input Output

3 9

slide-46
SLIDE 46

LEVEL #1 – SORTING NETWORKS

Abstract model for sorting keys.

→ Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies.

46

9 5 3 6 3 6 5 9 9 6 5 3 5 6

Input Output

3 5 6 9

slide-47
SLIDE 47

LEVEL #1 – SORTING NETWORKS

47

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

slide-48
SLIDE 48

LEVEL #1 – SORTING NETWORKS

48

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

Instructions:

→ 4 LOAD

slide-49
SLIDE 49

LEVEL #1 – SORTING NETWORKS

49

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

Sort Across Registers Instructions:

→ 4 LOAD

slide-50
SLIDE 50

LEVEL #1 – SORTING NETWORKS

50

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

Sort Across Registers Instructions:

→ 4 LOAD

slide-51
SLIDE 51

LEVEL #1 – SORTING NETWORKS

51

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13

Sort Across Registers Instructions:

→ 4 LOAD

slide-52
SLIDE 52

LEVEL #1 – SORTING NETWORKS

52

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13

Sort Across Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-53
SLIDE 53

LEVEL #1 – SORTING NETWORKS

53

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-54
SLIDE 54

LEVEL #1 – SORTING NETWORKS

54

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-55
SLIDE 55

LEVEL #1 – SORTING NETWORKS

55

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

slide-56
SLIDE 56

LEVEL #1 – SORTING NETWORKS

56

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

Instructions:

→ 8 SHUFFLE → 4 STORE

slide-57
SLIDE 57

LEVEL #2 – BITONIC MERGE NETWORK

Like a Sorting Network but it can merge two locally-sorted lists into a globally-sorted list. Can expand network to merge progressively larger lists (½ cache size). Intel’s Measurements

→ 2.25–3.5x speed-up over SISD implementation.

57

EFFI EFFICIENT ENT IMPLEM EMENT ENTATION N OF F SORT RTING NG ON ON MULTI-CO CORE VLDB 2008

slide-58
SLIDE 58

LEVEL #2 – BITONIC MERGE NETWORK

58

Input Output

b4 b3 b2 b1 a1 a2 a3 a4

S H U F F L E S H U F F L E

slide-59
SLIDE 59

LEVEL #2 – BITONIC MERGE NETWORK

59

Input Output

b4 b3 b2 b1

Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

slide-60
SLIDE 60

LEVEL #2 – BITONIC MERGE NETWORK

60

Input Output

b4 b3 b2 b1

Sorted Run Reverse Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

slide-61
SLIDE 61

LEVEL #2 – BITONIC MERGE NETWORK

61

Input Output

b4 b3 b2 b1

Sorted Run Reverse Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

min/max min/max min/max

slide-62
SLIDE 62

LEVEL #2 – BITONIC MERGE NETWORK

62

Input Output

b4 b3 b2 b1

Sorted Run Reverse Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

Sorted Run

min/max min/max min/max

slide-63
SLIDE 63

LEVEL #3 – MULTI-WAY MERGING

Use the Bitonic Merge Networks but split the process up into tasks.

→ Still one worker thread per core. → Link together tasks with a cache-sized FIFO queue.

A task blocks when either its input queue is empty

  • r its output queue is full.

Requires more CPU instructions, but brings bandwidth and compute into balance.

63

slide-64
SLIDE 64

Sorted Runs

LEVEL #3 – MULTI-WAY MERGING

64

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

slide-65
SLIDE 65

Sorted Runs

LEVEL #3 – MULTI-WAY MERGING

65

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

slide-66
SLIDE 66

Sorted Runs

LEVEL #3 – MULTI-WAY MERGING

66

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

slide-67
SLIDE 67

MERGE PHASE

Iterate through the outer table and inner table in lockstep and compare join keys. May need to backtrack if there are duplicates. Can be done in parallel at the different cores without synchronization if there are separate

  • utput buffers.

67

slide-68
SLIDE 68

SORT-MERGE JOIN VARIANTS

Multi-Way Sort-Merge (M-WAY) Multi-Pass Sort-Merge (M-PASS) Massively Parallel Sort-Merge (MPSM)

68

slide-69
SLIDE 69

MULTI-WAY SORT-MERGE

Outer Table

→ Each core sorts in parallel on local data (levels #1/#2). → Redistribute sorted runs across cores using the multi- way merge (level #3).

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

  • f outer/inner tables at each core.

69

MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS.

  • VS. HASH

SH REVI VISI SITED VLDB 2013

slide-70
SLIDE 70

MULTI-WAY SORT-MERGE

70

slide-71
SLIDE 71

MULTI-WAY SORT-MERGE

71

Local-NUMA Partitioning

slide-72
SLIDE 72

MULTI-WAY SORT-MERGE

72

Local-NUMA Partitioning

slide-73
SLIDE 73

MULTI-WAY SORT-MERGE

73

Local-NUMA Partitioning Sort

slide-74
SLIDE 74

MULTI-WAY SORT-MERGE

74

Local-NUMA Partitioning Sort Multi-Way Merge

slide-75
SLIDE 75

MULTI-WAY SORT-MERGE

75

Local-NUMA Partitioning Sort Multi-Way Merge

slide-76
SLIDE 76

MULTI-WAY SORT-MERGE

76

Local-NUMA Partitioning Sort Multi-Way Merge

slide-77
SLIDE 77

MULTI-WAY SORT-MERGE

77

Local-NUMA Partitioning Sort Multi-Way Merge

slide-78
SLIDE 78

MULTI-WAY SORT-MERGE

78

Local-NUMA Partitioning Sort Multi-Way Merge

slide-79
SLIDE 79

MULTI-WAY SORT-MERGE

79

Local-NUMA Partitioning Sort Multi-Way Merge

slide-80
SLIDE 80

MULTI-WAY SORT-MERGE

80

Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table

slide-81
SLIDE 81

MULTI-WAY SORT-MERGE

81

SORT! SORT! SORT! SORT!

Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table

slide-82
SLIDE 82

MULTI-WAY SORT-MERGE

82

SORT! SORT! SORT! SORT!

Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table

slide-83
SLIDE 83

MULTI-WAY SORT-MERGE

83

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table

slide-84
SLIDE 84

MULTI-WAY SORT-MERGE

84

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table

slide-85
SLIDE 85

MULTI-PASS SORT-MERGE

Outer Table

→ Same level #1/#2 sorting as Multi-Way. → But instead of redistributing, it uses a multi-pass naïve merge on sorted runs.

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

  • f outer table and inner table.

85

MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS.

  • VS. HASH

SH REVI VISI SITED VLDB 2013

slide-86
SLIDE 86

MASSIVELY PARALLEL SORT-MERGE

Outer Table

→ Range-partition outer table and redistribute to cores. → Each core sorts in parallel on their partitions.

Inner Table

→ Not redistributed like outer table. → Each core sorts its local data.

Merge phase is between entire sorted run of outer table and a segment of inner table.

86

MA MASSIVELY PARALLEL SORT-ME MERGE JOINS IN MA MAIN ME MEMO MORY MU MULTI-CO CORE DATABA BASE SYSTEMS VLDB 2012

slide-87
SLIDE 87

MASSIVELY PARALLEL SORT-MERGE

87

slide-88
SLIDE 88

MASSIVELY PARALLEL SORT-MERGE

88

Cross-NUMA Partitioning

slide-89
SLIDE 89

MASSIVELY PARALLEL SORT-MERGE

89

Cross-NUMA Partitioning

slide-90
SLIDE 90

MASSIVELY PARALLEL SORT-MERGE

90

Cross-NUMA Partitioning Sort

slide-91
SLIDE 91

MASSIVELY PARALLEL SORT-MERGE

91

Cross-NUMA Partitioning Sort

slide-92
SLIDE 92

MASSIVELY PARALLEL SORT-MERGE

92

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort

slide-93
SLIDE 93

MASSIVELY PARALLEL SORT-MERGE

93

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-94
SLIDE 94

MASSIVELY PARALLEL SORT-MERGE

94

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-95
SLIDE 95

MASSIVELY PARALLEL SORT-MERGE

95

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-96
SLIDE 96

MASSIVELY PARALLEL SORT-MERGE

96

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-97
SLIDE 97

MASSIVELY PARALLEL SORT-MERGE

97

SORT! SORT! SORT! SORT!

⨝ ⨝

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-98
SLIDE 98

MASSIVELY PARALLEL SORT-MERGE

98

SORT! SORT! SORT! SORT!

⨝ ⨝

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-99
SLIDE 99

MASSIVELY PARALLEL SORT-MERGE

99

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

slide-100
SLIDE 100

HYPER’s RULES FOR PARALLELIZATION

Rule #1: No random writes to non-local memory

→ Chunk the data, redistribute, and then each core sorts/works on local data.

Rule #2: Only perform sequential reads on non-local memory

→ This allows the hardware prefetcher to hide remote access latency.

Rule #3: No core should ever wait for another

→ Avoid fine-grained latching or sync barriers.

100

Source: Martina-Cezara Albutiu

slide-101
SLIDE 101

EVALUATION

Compare the different join algorithms using a synthetic data set.

→ Sort-Merge: M-WAY, M-PASS, MPSM → Hash: Radix Partitioning

Hardware:

→ 4 Socket Intel Xeon E4640 @ 2.4GHz → 8 Cores with 2 Threads Per Core → 512 GB of DRAM

101

slide-102
SLIDE 102

RAW SORTING PERFORMANCE

102

9 18 27 36 1 2 4 8 16 32 64 128 256

Throughput (M Tuples/sec) Number of Tuples (in 220)

C++ STL Sort SIMD Sort

Source: Cagri Balkesen

Single-threaded sorting performance

slide-103
SLIDE 103

RAW SORTING PERFORMANCE

103

9 18 27 36 1 2 4 8 16 32 64 128 256

Throughput (M Tuples/sec) Number of Tuples (in 220)

C++ STL Sort SIMD Sort

Source: Cagri Balkesen

Single-threaded sorting performance

2.5–3x Faster

slide-104
SLIDE 104

COMPARISON OF SORT-MERGE JOINS

104

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

slide-105
SLIDE 105

COMPARISON OF SORT-MERGE JOINS

105

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

slide-106
SLIDE 106

COMPARISON OF SORT-MERGE JOINS

106

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

slide-107
SLIDE 107

COMPARISON OF SORT-MERGE JOINS

107

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

slide-108
SLIDE 108

Hyper- Threading

M-WAY JOIN VS. MPSM JOIN

108

100 200 300 400 1 2 4 8 16 32 64

Throughput (M Tuples/sec) Number of Threads

Multi-Way Massively Parallel

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

slide-109
SLIDE 109

Hyper- Threading

M-WAY JOIN VS. MPSM JOIN

109

100 200 300 400 1 2 4 8 16 32 64

Throughput (M Tuples/sec) Number of Threads

Multi-Way Massively Parallel 315 M/sec

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

130 M/sec 259 M/sec

slide-110
SLIDE 110

Hyper- Threading

M-WAY JOIN VS. MPSM JOIN

110

100 200 300 400 1 2 4 8 16 32 64

Throughput (M Tuples/sec) Number of Threads

Multi-Way Massively Parallel 108 M/sec 315 M/sec

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

130 M/sec 54 M/sec 259 M/sec 90 M/sec

slide-111
SLIDE 111

SORT-MERGE JOIN VS. HASH JOIN

111

2 4 6 8

SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B

Cycles / Output Tuple

Partition Sort S-Merge M-Join Build+Probe

Source: Cagri Balkesen

4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core

slide-112
SLIDE 112

SORT-MERGE JOIN VS. HASH JOIN

112

2 4 6 8

SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B

Cycles / Output Tuple

Partition Sort S-Merge M-Join Build+Probe

Source: Cagri Balkesen

4 Socket Intel Xeon E4640 @ 2.4GHz 8 Cores with 2 Threads Per Core

slide-113
SLIDE 113

SORT-MERGE JOIN VS. HASH JOIN

113

150 300 450 600 750 128 256 384 512 768 1024 1280 1536 1792 1920

Throughput (M Tuples/sec) Millions of Tuples

Multi-Way Sort-Merge Join Radix Hash Join

Source: Cagri Balkesen

Varying the size of the input relations

slide-114
SLIDE 114

PARTING THOUGHTS

Both join approaches are equally important. Every serious OLAP DBMS supports both. We did not consider the impact of queries where the output needs to be sorted.

114

slide-115
SLIDE 115

CODE REVIEWS

Each group will send a pull request to the target repos.

→ PR must be able to merge cleanly into master branch. → Reviewing group will write comments on that request. → Add the URL to the Google spreadsheet and notify the reviewing team that it is ready.

Please be helpful and courteous.

115

slide-116
SLIDE 116

GENERAL TIPS

The dev team should provide you with a summary

  • f what files/functions the reviewing team should

look at. Review fewer than 400 lines of code at a time and

  • nly for at most 60 minutes.

Use a checklist to outline what kind of problems you are looking for.

116

Source: SmartBear

slide-117
SLIDE 117

CHECKLIST – GENERAL

Does the code work? Is all the code easily understood? Is there any redundant or duplicate code? Is the code as modular as possible? Can any global variables be replaced? Is there any commented out code? Is it using proper debug log functions?

117

Source: Gareth Wilson

slide-118
SLIDE 118

CHECKLIST – DOCUMENTATION

Do comments describe the intent of the code? Are all functions commented? Is any unusual behavior described? Is the use of 3rd-party libraries documented? Is there any incomplete code?

118

Source: Gareth Wilson

slide-119
SLIDE 119

CHECKLIST – TESTING

Do tests exist and are they comprehensive? Are the tests actually testing the feature? Are they relying on hardcoded answers? What is the code coverage?

119

Source: Gareth Wilson

slide-120
SLIDE 120

NEXT CLASS

Vectorized Execution Reminder: Code Review April 21st @ 11:59pm

120