ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // - - PowerPoint PPT Presentation

advanced database systems
SMART_READER_LITE
LIVE PREVIEW

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // - - PowerPoint PPT Presentation

Lect ure # 20 ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 Background Hardware Vectorized Algorithms (Columbia) CMU 15-721 (Spring 2019) 3 VECTO RIZATIO N The process


slide-1
SLIDE 1

Vectorized Execution

@ Andy_Pavlo // 15- 721 // Spring 2019

ADVANCED DATABASE SYSTEMS

Lect ure # 20

slide-2
SLIDE 2 CMU 15-721 (Spring 2019)

Background Hardware Vectorized Algorithms (Columbia)

2

slide-3
SLIDE 3 CMU 15-721 (Spring 2019)

VECTO RIZATIO N

The process of converting an algorithm's scalar implementation that processes a single pair of

  • perands at a time, to a vector implementation

that processes one operation on multiple pairs of

  • perands at once.

3

slide-4
SLIDE 4 CMU 15-721 (Spring 2019)

WH Y TH IS M ATTERS

Say we can parallelize our algorithm over 32 cores. Each core has a 4-wide SIMD registers. Potential Speed-up: 32x × 4x = 128x

4

slide-5
SLIDE 5 CMU 15-721 (Spring 2019)

M ULTI- CO RE CPUS

Use a small number of high-powered cores.

→ Intel Xeon Skylake / Kaby Lake → High power consumption and area per core.

Massively superscalar and aggressive out-of-

  • rder execution

→ Instructions are issued from a sequential stream. → Check for dependencies between instructions. → Process multiple instructions per clock cycle.

5

slide-6
SLIDE 6 CMU 15-721 (Spring 2019)

M AN Y IN TEGRATED CO RES (M IC)

Use a larger number of low-powered cores.

→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)

→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)

→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

6

slide-7
SLIDE 7 CMU 15-721 (Spring 2019)

M AN Y IN TEGRATED CO RES (M IC)

Use a larger number of low-powered cores.

→ Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)

→ Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)

→ Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)

6

slide-8
SLIDE 8 CMU 15-721 (Spring 2019)

SIN GLE IN STRUCTIO N, M ULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.

→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 → PowerPC: Altivec → ARM: NEON

8

slide-9
SLIDE 9 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-10
SLIDE 10 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-11
SLIDE 11 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-12
SLIDE 12 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-13
SLIDE 13 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

9

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

slide-14
SLIDE 14 CMU 15-721 (Spring 2019)

STREAM IN G SIM D EXTEN SIO N S (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can be performed

  • n each of the four elements simultaneously.

First introduced by Intel in 1999.

10

slide-15
SLIDE 15 CMU 15-721 (Spring 2019)

SIM D IN STRUCTIO NS (1)

Data Movement

→ Moving data in and out of vector registers

Arithmetic Operations

→ Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) → Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions

→ Logical operations on multiple data items → Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

11

slide-16
SLIDE 16 CMU 15-721 (Spring 2019)

SIM D IN STRUCTIO NS (2)

Comparison Instructions

→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions

→ Move data in between SIMD registers

Miscellaneous

→ Conversion: Transform data between x86 and SIMD registers. → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache).

12

slide-17
SLIDE 17 CMU 15-721 (Spring 2019)

IN TEL SIM D EXTEN SIO N S

13

Width Integers Single-P Double-P

1997 MMX

64 bits ✔

1999 SSE

128 bits ✔ ✔(×4)

2001 SSE2

128 bits ✔ ✔ ✔(×2)

2004 SSE3

128 bits ✔ ✔ ✔

2006 SSSE 3

128 bits ✔ ✔ ✔

2006 SSE 4.1

128 bits ✔ ✔ ✔

2008 SSE 4.2

128 bits ✔ ✔ ✔

2011 AVX

256 bits ✔ ✔(×8) ✔(×4)

2013 AVX2

256 bits ✔ ✔ ✔

2017 AVX-512

512 bits ✔ ✔(×16) ✔(×8)

Source: James Reinders

slide-18
SLIDE 18 CMU 15-721 (Spring 2019)

VECTO RIZATIO N

Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization

14

Source: James Reinders

slide-19
SLIDE 19 CMU 15-721 (Spring 2019)

VECTO RIZATIO N

Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization

14

Source: James Reinders

Ease of Use Programmer Control

slide-20
SLIDE 20 CMU 15-721 (Spring 2019)

AUTO M ATIC VECTO RIZATIO N

The compiler can identify when instructions inside of a loop can be rewritten as a vectorized

  • peration.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

15

slide-21
SLIDE 21 CMU 15-721 (Spring 2019)

AUTO M ATIC VECTO RIZATIO N

This loop is not legal to automatically vectorize.

16

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-22
SLIDE 22 CMU 15-721 (Spring 2019)

AUTO M ATIC VECTO RIZATIO N

This loop is not legal to automatically vectorize.

16

These might point to the same address!

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

*Z=*X+1

slide-23
SLIDE 23 CMU 15-721 (Spring 2019)

AUTO M ATIC VECTO RIZATIO N

This loop is not legal to automatically vectorize. The code is written such that the addition is described as being done sequentially.

16

These might point to the same address!

void add(int *X, int *Y, int *Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

*Z=*X+1

slide-24
SLIDE 24 CMU 15-721 (Spring 2019)

CO M PILER H IN TS

Provide the compiler with additional information about the code to let it know that is safe to vectorize. Two approaches:

→ Give explicit information about memory locations. → Tell the compiler to ignore vector dependencies.

17

slide-25
SLIDE 25 CMU 15-721 (Spring 2019)

CO M PILER H IN TS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

18

void add(int *restrict X, int *restrict Y, int *restrict Z) { for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-26
SLIDE 26 CMU 15-721 (Spring 2019)

CO M PILER H IN TS

This pragma tells the compiler to ignore loop dependencies for the vectors. It’s up to you make sure that this is correct.

19

void add(int *X, int *Y, int *Z) { #pragma ivdep for (int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }

slide-27
SLIDE 27 CMU 15-721 (Spring 2019)

EXPLICIT VECTO RIZATIO N

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.

20

slide-28
SLIDE 28 CMU 15-721 (Spring 2019)

EXPLICIT VECTO RIZATIO N

Store the vectors in 128-bit SIMD registers. Then invoke the intrinsic to add together the vectors and write them to the output location.

21

void add(int *X, int *Y, int *Z) { __mm128i *vecX = (__m128i*)X; __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; for (int i=0; i<MAX/4; i++) { _mm_store_si128(vecZ++, ⮱_mm_add_epi32(*vecX++, ⮱*vecY++)); } }

slide-29
SLIDE 29 CMU 15-721 (Spring 2019)

VECTO RIZATIO N DIRECTIO N

Approach #1: Horizontal

→ Perform operation on all elements together within a single vector.

Approach #2: Vertical

→ Perform operation in an elementwise manner on elements of each vector.

22

Source:

0 1 2 3

SIMD Add

6 0 1 2 3

SIMD Add

1 1 1 1 1 2 3 4

slide-30
SLIDE 30 CMU 15-721 (Spring 2019)

EXPLICIT VECTO RIZATIO N

Linear Access Operators

→ Predicate evaluation → Compression

Ad-hoc Vectorization

→ Sorting → Merging

Composable Operations

→ Multi-way trees → Bucketized hash tables

23

Source: Orestis Polychroniou

slide-31
SLIDE 31 CMU 15-721 (Spring 2019)

VECTO RIZED DBM S ALGO RITH M S

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.

→ Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset.

24

RETHINKING SIMD VECTORIZATION FOR IN IN- MEMORY DATABASES

SIGMOD 2015

slide-32
SLIDE 32 CMU 15-721 (Spring 2019)

FUN DAM EN TAL O PERATIO N S

Selective Load Selective Store Selective Gather Selective Scatter

25

slide-33
SLIDE 33 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-34
SLIDE 34 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • •

slide-35
SLIDE 35 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U

slide-36
SLIDE 36 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U

slide-37
SLIDE 37 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V

slide-38
SLIDE 38 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

slide-39
SLIDE 39 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

slide-40
SLIDE 40 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B

slide-41
SLIDE 41 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B

slide-42
SLIDE 42 CMU 15-721 (Spring 2019)

FUN DAM EN TAL VECTO R O PERATIO N S

26

Selective Load Selective Store

A B C D

Vector Memory

0 1 0 1

Mask

U V W X Y Z • • • U V A B C D

Vector

U V W X Y Z • • •

Memory

0 1 0 1

Mask

B D

slide-43
SLIDE 43 CMU 15-721 (Spring 2019)

Selective Gather

FUN DAM EN TAL VECTO R O PERATIO N S

27

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A

2 1 3 5 4

slide-44
SLIDE 44 CMU 15-721 (Spring 2019)

Selective Gather

FUN DAM EN TAL VECTO R O PERATIO N S

27

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A W

2 1 3 5 4

slide-45
SLIDE 45 CMU 15-721 (Spring 2019)

Selective Gather

FUN DAM EN TAL VECTO R O PERATIO N S

27

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • C A W V X Z

2 1 3 5 4

slide-46
SLIDE 46 CMU 15-721 (Spring 2019)

Selective Gather

FUN DAM EN TAL VECTO R O PERATIO N S

27

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z

2 1 3 5 4 2 1 3 5 4

slide-47
SLIDE 47 CMU 15-721 (Spring 2019)

Selective Gather

FUN DAM EN TAL VECTO R O PERATIO N S

27

Selective Scatter

A B D

Value Vector Memory

2 1 5 3

Index Vector

U V W X Y Z • • • A B C D

Value Vector

U V W X Y Z • • •

Memory

2 1 5 3

Index Vector

C A W V X Z B A C D

2 1 3 5 4 2 1 3 5 4

slide-48
SLIDE 48 CMU 15-721 (Spring 2019)

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle. Gathers are only supported in newer CPUs. Selective loads and stores are also implemented in Xeon CPUs using vector permutations.

28

slide-49
SLIDE 49 CMU 15-721 (Spring 2019)

VECTO RIZED O PERATO RS

Selection Scans Hash Tables Partitioning Paper provides additional info:

→ Joins, Sorting, Bloom filters.

29

RETHINKING SIMD VECTORIZATION FOR IN IN- MEMORY DATABASES

SIGMOD 2015

slide-50
SLIDE 50 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

30

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

Scalar (Branchless)

i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && ⮱(key≤high ? 1 : 0) i = i + m

Source: Bogdan Raducanu

slide-51
SLIDE 51 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

30

Scalar (Branching)

i = 0 for t in table: key = t.key if (key≥low) && (key≤high): copy(t, output[i]) i = i + 1

Scalar (Branchless)

i = 0 for t in table: copy(t, output[i]) key = t.key m = (key≥low ? 1 : 0) && ⮱(key≤high ? 1 : 0) i = i + m

Source: Bogdan Raducanu

slide-52
SLIDE 52 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

31

Source: Bogdan Raducanu

slide-53
SLIDE 53 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

slide-54
SLIDE 54 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-55
SLIDE 55 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-56
SLIDE 56 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-57
SLIDE 57 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5

All Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-58
SLIDE 58 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5

All Offsets

SIMD Store

1 3 4

Matched Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-59
SLIDE 59 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

32

Vectorized

i = 0 for vt in table: simdLoad(vt.key, vk) vm = (vk≥low ? 1 : 0) && ↪(vk≤high ? 1 : 0) simdStore(vt, vm, output[i]) i = i + |vm≠false|

J O Y S U X

Key Vector

ID 1 KEY J 2 O 3 Y 4 S 5 U 6 X

Mask

0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5

All Offsets

SIMD Store

1 3 4

Matched Offsets SELECT * FROM table WHERE key >= "O" AND key <= "U"

slide-60
SLIDE 60 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

33

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) 5.7 5.6 5.3 5.7 4.9 4.3 2.8 1.3 1.7 1.7 1.7 1.8 1.6 1.4 1.5 1.2

slide-61
SLIDE 61 CMU 15-721 (Spring 2019)

SELECTIO N SCAN S

33

16 32 48

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%) Scalar (Branching) Scalar (Branchless) Vectorized (Early Mat) Vectorized (Late Mat)

0.0 2.0 4.0 6.0

1 2 5 10 20 50 100

Throughput (billion tuples / sec) Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) Memory Bandwidth Memory Bandwidth

slide-62
SLIDE 62 CMU 15-721 (Spring 2019)

PAYLOAD KEY

Linear Probing Hash Table

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

slide-63
SLIDE 63 CMU 15-721 (Spring 2019)

PAYLOAD KEY

Linear Probing Hash Table

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key) k1 k9

=

slide-64
SLIDE 64 CMU 15-721 (Spring 2019)

PAYLOAD KEY

Linear Probing Hash Table

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key) k1 k9

=

k3

=

k8

=

k1

=

slide-65
SLIDE 65 CMU 15-721 (Spring 2019)

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

slide-66
SLIDE 66 CMU 15-721 (Spring 2019)

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

k9

=

k3 k8 k1

k1

slide-67
SLIDE 67 CMU 15-721 (Spring 2019)

H ASH TABLES PRO BIN G

34

Scalar

k1 Input Key h1 Hash Index

#

hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table k1 Input Key h1 Hash Index

#

hash(key)

k9

=

k3 k8 k1

k1

0 0 0 1

Matched Mask

SIMD Compare

slide-68
SLIDE 68 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector k1 k2 k3 k4

slide-69
SLIDE 69 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4

slide-70
SLIDE 70 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Gather

k1 k2 k3 k4

slide-71
SLIDE 71 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4

slide-72
SLIDE 72 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4

slide-73
SLIDE 73 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k1 k99 k88 k4

= = = =

SIMD Compare

1 1 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6

slide-74
SLIDE 74 CMU 15-721 (Spring 2019)

PAYLOAD

k99 k1 k6 k4

KEY

k5 k88

Linear Probing Hash Table

H ASH TABLES PRO BIN G

35

Vectorized (Vertical)

Input Key Vector hash(key)

# # # #

Hash Index Vector h1 h2 h3 h4 k1 k2 k3 k4 k5 k6 h5 h2+1 h3+1 h6

slide-75
SLIDE 75 CMU 15-721 (Spring 2019)

H ASH TABLES PRO BIN G

36

3 6 9 12

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

Throughput (billion tuples / sec) Hash Table Size

2.3 2.2 2.1 2.4 1.1 0.9 0.7 0.6 1.1 1.1 0.9 1.2 0.8 0.8 0.3 0.2

slide-76
SLIDE 76 CMU 15-721 (Spring 2019)

H ASH TABLES PRO BIN G

36

3 6 9 12

Throughput (billion tuples / sec) Hash Table Size Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0.5 1 1.5 2

Throughput (billion tuples / sec) Hash Table Size

Out of Cache Out of Cache

slide-77
SLIDE 77 CMU 15-721 (Spring 2019)

PARTITIO N IN G H ISTO GRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

37

slide-78
SLIDE 78 CMU 15-721 (Spring 2019)

PARTITIO N IN G H ISTO GRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

37

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector

SIMD Add SIMD Radix

+1 +1 +1 Histogram

slide-79
SLIDE 79 CMU 15-721 (Spring 2019)

PARTITIO N IN G H ISTO GRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

37

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector

SIMD Add SIMD Radix

+1 +1 +1 Histogram

Missing Update

slide-80
SLIDE 80 CMU 15-721 (Spring 2019)

PARTITIO N IN G H ISTO GRAM

Use scatter and gathers to increment counts. Replicate the histogram to handle collisions.

37

k1 k2 k3 k4 Input Key Vector h1 h2 h3 h4 Hash Index Vector Replicated Histogram +1 +1 +1 +1

SIMD Add

# of Vector Lanes

SIMD Radix

+1 +2 +1 Histogram

SIMD Scatter

slide-81
SLIDE 81 CMU 15-721 (Spring 2019)

J O IN S

No Partitioning

→ Build one shared hash table using atomics → Partially vectorized

Min Partitioning

→ Partition building table → Build one hash table per thread → Fully vectorized

Max Partitioning

→ Partition both tables repeatedly → Build and probe cache-resident hash tables → Fully vectorized

38

slide-82
SLIDE 82 CMU 15-721 (Spring 2019)

J O IN S

39

0.5 1 1.5 2 Scalar Vector Scalar Vector Scalar Vector No Partitioning Min Partitioning Max Partitioning Join Time (sec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads) Xeon Phi 7120P – 61 Cores + 4×HT

slide-83
SLIDE 83 CMU 15-721 (Spring 2019)

PARTIN G TH O UGH TS

Vectorization is essential for OLAP queries. These algorithms don’t work when the data exceeds your CPU cache. We can combine all the intra-query parallelism

  • ptimizations we’ve talked about in a DBMS.

→ Multiple threads processing the same query. → Each thread can execute a compiled plan. → The compiled plan can invoke vectorized operations.

40

slide-84
SLIDE 84 CMU 15-721 (Spring 2019)

N EXT CLASS

Compilation vs. Vectorization

41