DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23: VECTORIZED EXECUTION 2 ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor Query Optimizer


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23: VECTORIZED EXECUTION

slide-2
SLIDE 2

ANATOMY OF A DATABASE SYSTEM

Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager

2

Query Transactional Storage Manager Query Processor Shared Utilities Process Manager

Source: Anatomy of a Database System

slide-3
SLIDE 3

TODAY’S AGENDA

Background Hardware Vectorized Algorithms (Columbia)

3

slide-4
SLIDE 4

VECTORIZATION

The process of converting an algorithm's scalar implementation that processes a single pair of

  • perands at a time, to a vector implementation

that processes one operation on multiple pairs of

  • perands at once.

4

slide-5
SLIDE 5

WHY THIS MATTERS

Say we can parallelize our algorithm over 32 cores. Each core has a 4-wide SIMD registers. Potential Speed-up: 32x × 4x = 128x

5

slide-6
SLIDE 6

MULTI-CORE CPUS

Use a small number of high-powered cores.

→ Intel Xeon Skylake / Kaby Lake → High power consumption and area per core.

Massively superscalar and aggressive out-of-

  • rder execution

→ Instructions are issued from a sequential stream. → Check for dependencies between instructions. → Process multiple instructions per clock cycle.

6

slide-7
SLIDE 7

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.

→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)

→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)

→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

7

slide-8
SLIDE 8

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.

→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)

→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)

→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

8

slide-9
SLIDE 9

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.

→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)

→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)

→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

9

slide-10
SLIDE 10

SINGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.

→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 → PowerPC: Altivec → ARM: NEON

11

slide-11
SLIDE 11

SIMD EXAMPLE

12

X + Y = Z x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-12
SLIDE 12

Z

SIMD EXAMPLE

13

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-13
SLIDE 13

Z

SIMD EXAMPLE

14

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-14
SLIDE 14

Z

SIMD EXAMPLE

15

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-15
SLIDE 15

Z

SIMD EXAMPLE

16

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-16
SLIDE 16

Z

SIMD EXAMPLE

17

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-17
SLIDE 17

Z

SIMD EXAMPLE

18

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-18
SLIDE 18

Z

SIMD EXAMPLE

19

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-19
SLIDE 19

Z

SIMD EXAMPLE

20

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-20
SLIDE 20

Z

SIMD EXAMPLE

21

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-21
SLIDE 21

Z

SIMD EXAMPLE

22

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-22
SLIDE 22

STREAMING SIMD EXTENSIONS (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can be performed

  • n each of the four elements simultaneously.

First introduced by Intel in 1999.

23

slide-23
SLIDE 23

SIMD INSTRUCTIONS (1)

Data Movement

→ Moving data in and out of vector registers

Arithmetic Operations

→ Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) → Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions

→ Logical operations on multiple data items → Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

24

slide-24
SLIDE 24

SIMD INSTRUCTIONS (2)

Comparison Instructions

→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions

→ Move data in between SIMD registers

Miscellaneous

→ Conversion: Transform data between x86 and SIMD registers. → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache).

25

slide-25
SLIDE 25

INTEL SIMD EXTENSIONS

26

Width Integers Single-P Double-P

1997 MMX

64 bits

1999 SSE

128 bits (×4)

2001 SSE2

128 bits (×2)

2004 SSE3

128 bits

2006 SSSE 3

128 bits

2006 SSE 4.1

128 bits

2008 SSE 4.2

128 bits

2011 AVX

256 bits (×8) (×4)

2013 AVX2

256 bits

2017 AVX-512

512 bits (×16) (×8)

Source: James Reinders

slide-26
SLIDE 26

WHY NOT GPUS?

Moving data back and forth between DRAM and GPU is slow over PCI-E bus. There are some newer GPU-enabled DBMSs

→ Examples: MapD, SQream, Kinetica

Emerging co-processors that can share CPU’s memory may change this.

→ Examples: AMD’s APU, Intel’s Knights Landing

27

slide-27
SLIDE 27

VECTORIZATION

Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization

28

Source: James Reinders

slide-28
SLIDE 28

VECTORIZATION

Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization

29

Source: James Reinders

Ease of Use Programmer Control

slide-29
SLIDE 29

AUTOMATIC VECTORIZATION

The compiler can identify when instructions inside of a loop can be rewritten as a vectorized

  • peration.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

30

slide-30
SLIDE 30

AUTOMATIC VECTORIZATION

31

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-31
SLIDE 31

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.

32

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-32
SLIDE 32

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.

33

These might point to the same address!

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-33
SLIDE 33

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.

34

These might point to the same address!

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

*Z=*X+1

slide-34
SLIDE 34

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.

35

These might point to the same address!

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

*Z=*X+1

slide-35
SLIDE 35

COMPILER HINTS

Provide the compiler with additional information about the code to let it know that is safe to vectorize. Two approaches:

→ Give explicit information about memory locations. → Tell the compiler to ignore vector dependencies.

36

slide-36
SLIDE 36

COMPILER HINTS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

37

void add(int *restrict X, int *restrict Y, int *restrict Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-37
SLIDE 37

COMPILER HINTS

This pragma tells the compiler to ignore loop dependencies for the vectors. It’s up to you make sure that this is correct.

38

void add(int *X, int *Y, int *Z) { #pragma ivdep for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-38
SLIDE 38

EXPLICIT VECTORIZATION

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.

39

slide-39
SLIDE 39

EXPLICIT VECTORIZATION

Store the vectors in 128-bit SIMD registers. Then invoke the intrinsic to add together the vectors and write them to the output location.

40

void add(int *X, int *Y, int *Z) { __mm128i *vecX = (__m128i*)X; __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; for (int i=0; i<MAX/4; i++) { _mm_store_si128(vecZ++, _mm_add_epi32(*vecX, *vecY)); } }

slide-40
SLIDE 40

VECTORIZATION DIRECTION

Approach #1: Horizontal

→ Perform operation on all elements together within a single vector.

Approach #2: Vertical

→ Perform operation in an elementwise manner on elements of each vector.

41

Source: Przemysław Karpiński

slide-41
SLIDE 41

VECTORIZATION DIRECTION

Approach #1: Horizontal

→ Perform operation on all elements together within a single vector.

Approach #2: Vertical

→ Perform operation in an elementwise manner on elements of each vector.

42

Source: Przemysław Karpiński

0 1 2 3

SIMD Add

6

slide-42
SLIDE 42

VECTORIZATION DIRECTION

Approach #1: Horizontal

→ Perform operation on all elements together within a single vector.

Approach #2: Vertical

→ Perform operation in an elementwise manner on elements of each vector.

43

Source: Przemysław Karpiński

0 1 2 3

SIMD Add

6 0 1 2 3

SIMD Add

1 1 1 1 1 2 3 4

slide-43
SLIDE 43

EXPLICIT VECTORIZATION

Linear Access Operators

→ Predicate evaluation → Compression

Ad-hoc Vectorization

→ Sorting → Merging

Composable Operations

→ Multi-way trees → Bucketized hash tables

44

Source: Orestis Polychroniou

slide-44
SLIDE 44

VECTORIZED DBMS ALGORITHMS

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.

→ Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset.

45

RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015

slide-45
SLIDE 45

FUNDAMENTAL OPERATIONS

Selective Load Selective Store Selective Gather Selective Scatter

46

slide-46
SLIDE 46

FUNDAMENTAL VECTOR OPERATIONS

47

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-47
SLIDE 47

FUNDAMENTAL VECTOR OPERATIONS

48

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-48
SLIDE 48

FUNDAMENTAL VECTOR OPERATIONS

49

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-49
SLIDE 49

FUNDAMENTAL VECTOR OPERATIONS

50

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-50
SLIDE 50

FUNDAMENTAL VECTOR OPERATIONS

51

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-51
SLIDE 51

FUNDAMENTAL VECTOR OPERATIONS

52

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U

slide-52
SLIDE 52

FUNDAMENTAL VECTOR OPERATIONS

53

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U

slide-53
SLIDE 53

FUNDAMENTAL VECTOR OPERATIONS

54

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V

slide-54
SLIDE 54

FUNDAMENTAL VECTOR OPERATIONS

55

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

slide-55
SLIDE 55

FUNDAMENTAL VECTOR OPERATIONS

56

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

slide-56
SLIDE 56

FUNDAMENTAL VECTOR OPERATIONS

57

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

slide-57
SLIDE 57

FUNDAMENTAL VECTOR OPERATIONS

58

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B

slide-58
SLIDE 58

FUNDAMENTAL VECTOR OPERATIONS

59

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B

slide-59
SLIDE 59

FUNDAMENTAL VECTOR OPERATIONS

60

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B D

slide-60
SLIDE 60

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

61

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A

slide-61
SLIDE 61

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

62

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A

slide-62
SLIDE 62

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

63

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A

2 1 3 5 4

slide-63
SLIDE 63

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

64

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A

2 1 3 5 4

slide-64
SLIDE 64

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

65

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A W

2 1 3 5 4

slide-65
SLIDE 65

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

66

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A W V X Z

2 1 3 5 4

slide-66
SLIDE 66

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

67

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z

2 1 3 5 4

slide-67
SLIDE 67

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

68

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z

2 1 3 5 4

slide-68
SLIDE 68

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

69

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z

2 1 3 5 4 2 1 3 5 4

slide-69
SLIDE 69

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

70

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z

2 1 3 5 4 2 1 3 5 4

slide-70
SLIDE 70

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

71

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z B A C D

2 1 3 5 4 2 1 3 5 4

slide-71
SLIDE 71

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle. Gathers are only supported in newer CPUs. Selective loads and stores are also emulated in Xeon CPUs using vector permutations.

72

slide-72
SLIDE 72

VECTORIZED OPERATORS

Selection Scans Hash Tables Partitioning Paper provides additional info:

→ Joins, Sorting, Bloom filters.

73

RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015

slide-73
SLIDE 73

SELECTION SCANS

74

SELECT * FROM table WHERE key >= $(low) AND key <= $(high)

slide-74
SLIDE 74

SELECTION SCANS

75

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

slide-75
SLIDE 75

SELECTION SCANS

76

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

slide-76
SLIDE 76

SELECTION SCANS

77

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

Scalar (Branchless)

i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m

slide-77
SLIDE 77

SELECTION SCANS

78

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

Scalar (Branchless)

i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m

slide-78
SLIDE 78

SELECTION SCANS

79

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

Scalar (Branchless)

i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m

Source: Bogdan Raducanu

slide-79
SLIDE 79

SELECTION SCANS

80

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-80
SLIDE 80

SELECTION SCANS

81

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-81
SLIDE 81

SELECTION SCANS

82

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-82
SLIDE 82

SELECTION SCANS

83

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-83
SLIDE 83

SELECTION SCANS

84

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-84
SLIDE 84

SELECTION SCANS

85

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-85
SLIDE 85

SELECTION SCANS

86

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-86
SLIDE 86

SELECTION SCANS

87

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-87
SLIDE 87

SELECTION SCANS

88

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-88
SLIDE 88

SELECTION SCANS

89

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-89
SLIDE 89

SELECTION SCANS

90

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5

All Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-90
SLIDE 90

SELECTION SCANS

91

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5

All Offsets

SIMD Store

1 3 4

Matched Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-91
SLIDE 91

SELECTION SCANS

92

Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

slide-92
SLIDE 92

SELECTION SCANS

93

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) 5.7 5.6 5.3 5.7 4.9 4.3 2.8 1.3 1.7 1.7 1.7 1.8 1.6 1.4 1.5 1.2

slide-93
SLIDE 93

SELECTION SCANS

94

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

slide-94
SLIDE 94

SELECTION SCANS

95

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) Memory Bandwidth Memory Bandwidth

slide-95
SLIDE 95

SELECTION SCANS

96

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

slide-96
SLIDE 96

SELECTION SCANS

97

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

slide-97
SLIDE 97

PAYLOAD KEY

Linear Probing Hash Table

HASH TABLES – PROBING

98

slide-98
SLIDE 98

PAYLOAD KEY

Linear Probing Hash Table

HASH TABLES – PROBING

99

Scalar

k1 Input Key

slide-99
SLIDE 99

PAYLOAD KEY

Linear Probing Hash Table

HASH TABLES – PROBING

100

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

slide-100
SLIDE 100

PAYLOAD KEY

Linear Probing Hash Table

HASH TABLES – PROBING

101

Scalar

k1 Input Key h1 Hash Index

#

hash(key) k1 k9

=

slide-101
SLIDE 101

PAYLOAD KEY

Linear Probing Hash Table

HASH TABLES – PROBING

102

Scalar

k1 Input Key h1 Hash Index

#

hash(key) k1 k9

=

k3

=

k8

=

k1

=

slide-102
SLIDE 102

HASH TABLES – PROBING

103

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

slide-103
SLIDE 103

HASH TABLES – PROBING

104

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

slide-104
SLIDE 104

HASH TABLES – PROBING

105

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

k9

=

k3 k8 k1

k1

slide-105
SLIDE 105

HASH TABLES – PROBING

106

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

k9

=

k3 k8 k1

k1

0 0 0 1

Matched Mask

SIMD Compare

slide-106
SLIDE 106

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

107

Vectorized (Vertical)

Input Key Vector k1 k2 k3 k4

slide-107
SLIDE 107

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

108

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4

slide-108
SLIDE 108

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

109

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Gather

k1 k2 k3 k4

slide-109
SLIDE 109

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

110

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

k1 k2 k3 k4

slide-110
SLIDE 110

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

111

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4

slide-111
SLIDE 111

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

112

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4

slide-112
SLIDE 112

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

113

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6

slide-113
SLIDE 113

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

HASH TABLES – PROBING

114

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6

slide-114
SLIDE 114

HASH TABLES – PROBING

115

Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

slide-115
SLIDE 115

HASH TABLES – PROBING

116

3 6 9 12

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size

slide-116
SLIDE 116

HASH TABLES – PROBING

117

3 6 9 12

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size

2.3 2.2 2.1 2.4 1.1 0.9 0.7 0.6 1.1 1.1 0.9 1.2 0.8 0.8 0.3 0.2

slide-117
SLIDE 117

HASH TABLES – PROBING

118

3 6 9 12

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size

slide-118
SLIDE 118

HASH TABLES – PROBING

119

3 6 9 12

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Throughput (billion tuples / sec) Hash Table Size

Out of Cache Out of Cache

slide-119
SLIDE 119

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

120

slide-120
SLIDE 120

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

121

k1 k2 k3 k4 Input Key Vector

slide-121
SLIDE 121

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

122

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector

SIMD Add SIMD Radix

+1 +1 +1 Histogram

slide-122
SLIDE 122

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

123

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector

SIMD Add SIMD Radix

+1 +1 +1 Histogram

slide-123
SLIDE 123

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

124

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1

SIMD Radix SIMD Scatter

slide-124
SLIDE 124

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

125

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1

# of Vector Lanes

SIMD Radix SIMD Scatter

slide-125
SLIDE 125

PARTITIONING – HISTOGRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

126

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1

SIMD Add

# of Vector Lanes

SIMD Radix

+1 +2 +1 Histogram

SIMD Scatter

slide-126
SLIDE 126

JOINS

No Partitioning

→ Build one shared hash table using atomics → Partially vectorized

Min Partitioning

→ Partition building table → Build one hash table per thread → Fully vectorized

Max Partitioning

→ Partition both tables repeatedly → Build and probe cache-resident hash tables → Fully vectorized

127

slide-127
SLIDE 127

JOINS

128

0.5 1 1.5 2 Scalar Vector Scalar Vector Scalar Vector No Partitioning Min Partitioning Max Partitioning Join Time (sec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads) Xeon Phi 7120P – 61 Cores + 4×HT

slide-128
SLIDE 128

PARTING THOUGHTS

Vectorization is essential for OLAP queries. These algorithms don’t work when the data exceeds your CPU cache. We can combine all the intra-query parallelism

  • ptimizations we’ve talked about in a DBMS.

→ Multiple threads processing the same query. → Each thread can execute a compiled plan. → The compiled plan can invoke vectorized operations.

129

slide-129
SLIDE 129

NEXT CLASS

Reminder: No class on 4/18 (Thu). Reminder: Guest lecture on 4/23 (Tue). Reminder: Extra credit due on 4/18 (Thu). Reminder: Final presentation on 4/25 (Thu).

130