DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23: VECTORIZED EXECUTION 2 ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor Query Optimizer
ANATOMY OF A DATABASE SYSTEM
Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager
2
Query Transactional Storage Manager Query Processor Shared Utilities Process Manager
Source: Anatomy of a Database System
TODAY’S AGENDA
Background Hardware Vectorized Algorithms (Columbia)
3
VECTORIZATION
The process of converting an algorithm's scalar implementation that processes a single pair of
- perands at a time, to a vector implementation
that processes one operation on multiple pairs of
- perands at once.
4
WHY THIS MATTERS
Say we can parallelize our algorithm over 32 cores. Each core has a 4-wide SIMD registers. Potential Speed-up: 32x × 4x = 128x
5
MULTI-CORE CPUS
Use a small number of high-powered cores.
→ Intel Xeon Skylake / Kaby Lake → High power consumption and area per core.
Massively superscalar and aggressive out-of-
- rder execution
→ Instructions are issued from a sequential stream. → Check for dependencies between instructions. → Process multiple instructions per clock cycle.
6
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.
→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)
→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)
→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)
7
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.
→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)
→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)
→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)
8
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.
→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)
→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)
→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)
9
SINGLE INSTRUCTION, MULTIPLE DATA
A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.
→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 → PowerPC: Altivec → ARM: NEON
11
SIMD EXAMPLE
12
X + Y = Z x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
13
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
14
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
15
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
16
X + Y = Z
8 7 6 5 4 3 2 1
X
SISD
+
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
17
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
18
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
SIMD
+
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
19
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
20
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 1 1 1 1 1 1 1 1
Y
SIMD
+
8 7 6 5 1 1 1 1
128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
21
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 1 1 1 1 1 1 1 1
Y
SIMD
+
4 3 2 1 1 1 1 1
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
Z
SIMD EXAMPLE
22
X + Y = Z
8 7 6 5 4 3 2 1
X
for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }
9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1
Y
SIMD
+
4 3 2 1 1 1 1 1
x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =
STREAMING SIMD EXTENSIONS (SSE)
SSE is a collection SIMD instructions that target special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can be performed
- n each of the four elements simultaneously.
First introduced by Intel in 1999.
23
SIMD INSTRUCTIONS (1)
Data Movement
→ Moving data in and out of vector registers
Arithmetic Operations
→ Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) → Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN
Logical Instructions
→ Logical operations on multiple data items → Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS
24
SIMD INSTRUCTIONS (2)
Comparison Instructions
→ Comparing multiple data items (==,<,<=,>,>=,!=)
Shuffle instructions
→ Move data in between SIMD registers
Miscellaneous
→ Conversion: Transform data between x86 and SIMD registers. → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache).
25
INTEL SIMD EXTENSIONS
26
Width Integers Single-P Double-P
1997 MMX
64 bits
1999 SSE
128 bits (×4)
2001 SSE2
128 bits (×2)
2004 SSE3
128 bits
2006 SSSE 3
128 bits
2006 SSE 4.1
128 bits
2008 SSE 4.2
128 bits
2011 AVX
256 bits (×8) (×4)
2013 AVX2
256 bits
2017 AVX-512
512 bits (×16) (×8)
Source: James Reinders
WHY NOT GPUS?
Moving data back and forth between DRAM and GPU is slow over PCI-E bus. There are some newer GPU-enabled DBMSs
→ Examples: MapD, SQream, Kinetica
Emerging co-processors that can share CPU’s memory may change this.
→ Examples: AMD’s APU, Intel’s Knights Landing
27
VECTORIZATION
Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization
28
Source: James Reinders
VECTORIZATION
Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization
29
Source: James Reinders
Ease of Use Programmer Control
AUTOMATIC VECTORIZATION
The compiler can identify when instructions inside of a loop can be rewritten as a vectorized
- peration.
Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.
30
AUTOMATIC VECTORIZATION
31
void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.
32
void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.
33
These might point to the same address!
void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.
34
These might point to the same address!
void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
*Z=*X+1
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.
35
These might point to the same address!
void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
*Z=*X+1
COMPILER HINTS
Provide the compiler with additional information about the code to let it know that is safe to vectorize. Two approaches:
→ Give explicit information about memory locations. → Tell the compiler to ignore vector dependencies.
36
COMPILER HINTS
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.
37
void add(int *restrict X, int *restrict Y, int *restrict Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
COMPILER HINTS
This pragma tells the compiler to ignore loop dependencies for the vectors. It’s up to you make sure that this is correct.
38
void add(int *X, int *Y, int *Z) { #pragma ivdep for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
EXPLICIT VECTORIZATION
Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.
39
EXPLICIT VECTORIZATION
Store the vectors in 128-bit SIMD registers. Then invoke the intrinsic to add together the vectors and write them to the output location.
40
void add(int *X, int *Y, int *Z) { __mm128i *vecX = (__m128i*)X; __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; for (int i=0; i<MAX/4; i++) { _mm_store_si128(vecZ++, _mm_add_epi32(*vecX, *vecY)); } }
VECTORIZATION DIRECTION
Approach #1: Horizontal
→ Perform operation on all elements together within a single vector.
Approach #2: Vertical
→ Perform operation in an elementwise manner on elements of each vector.
41
Source: Przemysław Karpiński
VECTORIZATION DIRECTION
Approach #1: Horizontal
→ Perform operation on all elements together within a single vector.
Approach #2: Vertical
→ Perform operation in an elementwise manner on elements of each vector.
42
Source: Przemysław Karpiński
0 1 2 3
SIMD Add
6
VECTORIZATION DIRECTION
Approach #1: Horizontal
→ Perform operation on all elements together within a single vector.
Approach #2: Vertical
→ Perform operation in an elementwise manner on elements of each vector.
43
Source: Przemysław Karpiński
0 1 2 3
SIMD Add
6 0 1 2 3
SIMD Add
1 1 1 1 1 2 3 4
EXPLICIT VECTORIZATION
Linear Access Operators
→ Predicate evaluation → Compression
Ad-hoc Vectorization
→ Sorting → Merging
Composable Operations
→ Multi-way trees → Bucketized hash tables
44
Source: Orestis Polychroniou
VECTORIZED DBMS ALGORITHMS
Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.
→ Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset.
45
RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015
FUNDAMENTAL OPERATIONS
Selective Load Selective Store Selective Gather Selective Scatter
46
FUNDAMENTAL VECTOR OPERATIONS
47
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
48
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
49
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
50
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
51
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
52
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U
FUNDAMENTAL VECTOR OPERATIONS
53
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U
FUNDAMENTAL VECTOR OPERATIONS
54
Selective Load
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V
FUNDAMENTAL VECTOR OPERATIONS
55
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
FUNDAMENTAL VECTOR OPERATIONS
56
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
FUNDAMENTAL VECTOR OPERATIONS
57
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
FUNDAMENTAL VECTOR OPERATIONS
58
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
B
FUNDAMENTAL VECTOR OPERATIONS
59
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
B
FUNDAMENTAL VECTOR OPERATIONS
60
Selective Load Selective Store
A B C D
Vector Memory
0 1 0 1
Mask
U V W X Y Z • • • U V A B C D
Vector
U V W X Y Z • • •
Memory
0 1 0 1
Mask
B D
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
61
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
62
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
63
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
64
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
65
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A W
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
66
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • C A W V X Z
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
67
Selective Scatter
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • A B C D
Value Vector
U V W X Y Z • • •
Memory
2 1 5 3
Index Vector
C A W V X Z
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
68
Selective Scatter
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • A B C D
Value Vector
U V W X Y Z • • •
Memory
2 1 5 3
Index Vector
C A W V X Z
2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
69
Selective Scatter
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • A B C D
Value Vector
U V W X Y Z • • •
Memory
2 1 5 3
Index Vector
C A W V X Z
2 1 3 5 4 2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
70
Selective Scatter
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • A B C D
Value Vector
U V W X Y Z • • •
Memory
2 1 5 3
Index Vector
C A W V X Z
2 1 3 5 4 2 1 3 5 4
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
71
Selective Scatter
A B D
Value Vector Memory
2 1 5 3
Index Vector
U V W X Y Z • • • A B C D
Value Vector
U V W X Y Z • • •
Memory
2 1 5 3
Index Vector
C A W V X Z B A C D
2 1 3 5 4 2 1 3 5 4
ISSUES
Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle. Gathers are only supported in newer CPUs. Selective loads and stores are also emulated in Xeon CPUs using vector permutations.
72
VECTORIZED OPERATORS
Selection Scans Hash Tables Partitioning Paper provides additional info:
→ Joins, Sorting, Bloom filters.
73
RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015
SELECTION SCANS
74
SELECT * FROM table WHERE key >= $(low) AND key <= $(high)
SELECTION SCANS
75
Scalar (Branching)
i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1
SELECTION SCANS
76
Scalar (Branching)
i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1
SELECTION SCANS
77
Scalar (Branching)
i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1
Scalar (Branchless)
i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m
SELECTION SCANS
78
Scalar (Branching)
i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1
Scalar (Branchless)
i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m
SELECTION SCANS
79
Scalar (Branching)
i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1
Scalar (Branchless)
i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && (key≤high ? 1 : 0) i = i + m
Source: Bogdan Raducanu
SELECTION SCANS
80
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
81
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
82
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
83
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
84
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
85
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECTION SCANS
86
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
87
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X
SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
88
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
J O Y S U X
Key Vector
ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X
SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
89
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
J O Y S U X
Key Vector
ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X
Mask
0 1 0 1 1 0
SIMD Compare
SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
90
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
J O Y S U X
Key Vector
ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X
Mask
0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5
All Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
91
Vectorized
i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && (vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|
J O Y S U X
Key Vector
ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X
Mask
0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5
All Offsets
SIMD Store
1 3 4
Matched Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"
SELECTION SCANS
92
Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
93
16 32 48
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
0.0 2.0 4.0 6.0
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) 5.7 5.6 5.3 5.7 4.9 4.3 2.8 1.3 1.7 1.7 1.7 1.8 1.6 1.4 1.5 1.2
SELECTION SCANS
94
16 32 48
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
0.0 2.0 4.0 6.0
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
95
16 32 48
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
0.0 2.0 4.0 6.0
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) Memory Bandwidth Memory Bandwidth
SELECTION SCANS
96
16 32 48
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
0.0 2.0 4.0 6.0
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
97
16 32 48
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)
0.0 2.0 4.0 6.0
1 2 5 10 20 50 100
Throughput (billion tuples / sec) Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
PAYLOAD KEY
Linear Probing Hash Table
HASH TABLES – PROBING
98
PAYLOAD KEY
Linear Probing Hash Table
HASH TABLES – PROBING
99
Scalar
k1 Input Key
PAYLOAD KEY
Linear Probing Hash Table
HASH TABLES – PROBING
100
Scalar
k1 Input Key h1 Hash Index
#
hash(key)
PAYLOAD KEY
Linear Probing Hash Table
HASH TABLES – PROBING
101
Scalar
k1 Input Key h1 Hash Index
#
hash(key) k1 k9
=
PAYLOAD KEY
Linear Probing Hash Table
HASH TABLES – PROBING
102
Scalar
k1 Input Key h1 Hash Index
#
hash(key) k1 k9
=
k3
=
k8
=
k1
=
HASH TABLES – PROBING
103
Scalar
k1 Input Key h1 Hash Index
#
hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
HASH TABLES – PROBING
104
Scalar
k1 Input Key h1 Hash Index
#
hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index
#
hash(key)
HASH TABLES – PROBING
105
Scalar
k1 Input Key h1 Hash Index
#
hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index
#
hash(key)
k9
=
k3 k8 k1
k1
HASH TABLES – PROBING
106
Scalar
k1 Input Key h1 Hash Index
#
hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index
#
hash(key)
k9
=
k3 k8 k1
k1
0 0 0 1
Matched Mask
SIMD Compare
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
107
Vectorized (Vertical)
Input Key Vector k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
108
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
109
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4
= = = =
SIMD Gather
k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
110
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4
= = = =
SIMD Compare
k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
111
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4
= = = =
SIMD Compare
1 1 k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
112
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4
= = = =
SIMD Compare
1 1 k1 k2 k3 k4
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
113
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4
= = = =
SIMD Compare
1 1 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6
PAYLOAD
k99 k1 k6 k4
KEY
k5 k88
Linear Probing Hash Table
HASH TABLES – PROBING
114
Vectorized (Vertical)
Input Key Vector hash(key)
# # # #
Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6
HASH TABLES – PROBING
115
Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
HASH TABLES – PROBING
116
3 6 9 12
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0.5 1 1.5 2
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size
HASH TABLES – PROBING
117
3 6 9 12
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0.5 1 1.5 2
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size
2.3 2.2 2.1 2.4 1.1 0.9 0.7 0.6 1.1 1.1 0.9 1.2 0.8 0.8 0.3 0.2
HASH TABLES – PROBING
118
3 6 9 12
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0.5 1 1.5 2
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size
HASH TABLES – PROBING
119
3 6 9 12
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0.5 1 1.5 2
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Throughput (billion tuples / sec) Hash Table Size
Out of Cache Out of Cache
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
120
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
121
k1 k2 k3 k4 Input Key Vector
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
122
k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector
SIMD Add SIMD Radix
+1 +1 +1 Histogram
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
123
k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector
SIMD Add SIMD Radix
+1 +1 +1 Histogram
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
124
k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1
SIMD Radix SIMD Scatter
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
125
k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1
# of Vector Lanes
SIMD Radix SIMD Scatter
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.
126
k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1
SIMD Add
# of Vector Lanes
SIMD Radix
+1 +2 +1 Histogram
SIMD Scatter
JOINS
No Partitioning
→ Build one shared hash table using atomics → Partially vectorized
Min Partitioning
→ Partition building table → Build one hash table per thread → Fully vectorized
Max Partitioning
→ Partition both tables repeatedly → Build and probe cache-resident hash tables → Fully vectorized
127
JOINS
128
0.5 1 1.5 2 Scalar Vector Scalar Vector Scalar Vector No Partitioning Min Partitioning Max Partitioning Join Time (sec)
Partition Build Probe Build+Probe
200M ⨝ 200M tuples (32-bit keys & payloads) Xeon Phi 7120P – 61 Cores + 4×HT
PARTING THOUGHTS
Vectorization is essential for OLAP queries. These algorithms don’t work when the data exceeds your CPU cache. We can combine all the intra-query parallelism
- ptimizations we’ve talked about in a DBMS.
→ Multiple threads processing the same query. → Each thread can execute a compiled plan. → The compiled plan can invoke vectorized operations.
129
NEXT CLASS
Reminder: No class on 4/18 (Thu). Reminder: Guest lecture on 4/23 (Tue). Reminder: Extra credit due on 4/18 (Thu). Reminder: Final presentation on 4/25 (Thu).
130