HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk - - PowerPoint PPT Presentation

hardware conscious data processing systems
SMART_READER_LITE
LIVE PREVIEW

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk - - PowerPoint PPT Presentation

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data Processing Performance - A Case Study Data Processing Performance - A Case Study 100 75 50 25 0 April May June July select sum(sales),


slide-1
SLIDE 1

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS

Holger Pirk http://doc.ic.ac.uk/~hlgr

slide-2
SLIDE 2

Data Processing Performance

  • A Case Study
slide-3
SLIDE 3

Data Processing Performance

  • A Case Study

25 50 75 100 April May June July

≈ select sum(sales), … where country = US … group by month, …

slide-4
SLIDE 4

Data Processing Performance

  • A Case Study

TPC-H Query 1 (Roughly 10 GB of Data)

≈ select sum(sales), … where country = US … group by month, …

slide-5
SLIDE 5

Data Processing Performance

  • A Case Study

0.01 0.10 1 10 100

select sum(a) where b=6 group by c

Postgres 01 100 s

slide-6
SLIDE 6

Data Processing Performance

  • A Case Study

100 s Postgres 0.01 0.10 1 10 100

select sum(a) where b=6 group by c

slide-7
SLIDE 7

Data Processing Performance

  • A Case Study

100 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

0.01 0.10 1 10 100

select sum(a) where b>6 group by c

The solution: memory-resident data?

slide-8
SLIDE 8

Data Processing Performance

  • A Case Study

100 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

0.01 0.10 1 10 100

slide-9
SLIDE 9

Data Processing Performance

  • A Case Study

100 s 96 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

0.01 0.10 1 10 100

slide-10
SLIDE 10

Data Processing Performance

  • A Case Study

100 s 96 s 3.2 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

MonetDB 0.01 0.10 1 10 100

30x faster!

slide-11
SLIDE 11

Data Processing Performance

  • A Case Study

100 s 96 s 3.2 s 0.162 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

MonetDB Voodoo 0.01 0.10 1 10 100

How to get there

slide-12
SLIDE 12

Data Processing Performance

  • A Case Study

100 s 96 s 3.2 s 0.162 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

MonetDB Voodoo 0.01 0.10 1 10 100

Column at a time Processing

slide-13
SLIDE 13

Optimizer Optimizer

Classic database architecture

SQL Logical Plan Physical Plan DB Kernel OS/Hardware/…

Select, Join, Group, … Tablescan, Hashjoin, … void scan(int* input, int* output){…}

DBMS OS

slide-14
SLIDE 14

Column at a time Processing

  • Postgres is a tuple at a time kernel
  • High interpretation overhead
  • Hard to parallelize
  • MonetDB is a column at a time kernel
  • Overhead amortized over many tuples/values
  • Easy to parallelize

Ramdisk does not impact performance

slide-15
SLIDE 15

…back to our study

100 s 96 s 3.2 s 0.162 s Memory Bandwidth bound, i.e., 37ms Postgres

  • n Disk

Postgres

  • n Ramdisk

MonetDB Voodoo 0.01 0.10 1 10 100

Generating

  • ptimized code
slide-16
SLIDE 16

return join join select R1 R2 R3

1 2 3 4 5 6

for tuple1 in R1: tmp1[hash(tuple1)] = tuple1 for tuple2 in R2: if hash(tuple2) in tmp1: tmp2[hash(tuple2)] = (tuple2, tmp1[hash(tuple2)]) for tuple3 in R3: if some_condition(tuple3): if hash(tuple3) in hash2: yield (tuple3, *tmp2[hash(tuple3)])

Query Compilers

1 2 3 4 5 6

4x Performance improvement: 3.2s to .7s Logical Plan Executable Program Read inputCalculate Hash Insert Value Calculate Hash Probe Table Read input

slide-17
SLIDE 17

Executable

Optimizer Optimizer

Query Compiler Architecture

SQL Logical Plan Physical Plan DB Kernel OS/Hardware/…

OS DBMS

slide-18
SLIDE 18

Optimizer Opti mizer

Query Compiler Architecture

SQL Logical Plan Executable OS/Hardware/…

Compiler OS Data Management System

slide-19
SLIDE 19

Database Performance Engineering

Compiler s Data Management Computer Architecture

slide-20
SLIDE 20

The Performance Engineering Deluge

Branch-Free Selections, Radix-Joins, Vectorized Processing, Archit ecture-Conscious Hashing, SIMD-Parallel Processing, Bitwise Processing, NUMA-Aware Processing, Superscalar (De)compression, Instruction-Cache Aware Processing, Multicore-Parallelism, Co-Processing, …

slide-21
SLIDE 21

Multicore vs. SIMD

SIMD-Parallel Processing Multicore-Parallelism

slide-22
SLIDE 22

sum

A data processing example

sum sum sum sum

slide-23
SLIDE 23

Multicore-Parallelism using TBB

partInput.partition)

  • 1

auto input = load("input"); 2 auto totalssum = 3 parallel_deterministic_reduce( 4 blocked_range<size_t>(0, input.size, 5 input.size / 1024), 6 0, [&input](auto& range, auto partsum) { 7 for(size_t i = range.begin(); 8 i < range.end(); i++) { 9 partsum += input.elements[i].constant; 10 } 11 return partsum; 12 }, 13 [](auto s1, auto s2) { return s1 + s2; });

  • Parallel Reduce
  • Block-Range
  • Per-Partition

Lambda

  • Global

Lambda

slide-24
SLIDE 24

Multicore vs. SIMD

  • Parallel Reduce
  • Block-Range
  • Per-Partition Lambda
  • Global Lambda

SIMD-Parallel Processing Multicore-Parallelism

6 3 9 2 2 4 1 + 8 7 9 3 =

Vectorized Add: Single Instruction Multiple Data parallelism

slide-25
SLIDE 25

A data processing example using SIMD

slide-26
SLIDE 26

A data processing example using SIMD

auto input = load("input"); typedef int v4i __attribute__((vector_size(16))); auto vSize = (sizeof(v4i) / sizeof(int)); v4i sums = {}; for(size_t i = 0; i < input.size / vSize; i++) { sums += ((v4i*)input.elements)[i]; } int* scalarSums = (int*)&sums; auto totalsum = 0l; for(size_t i = 0; i < 4; i++) { totalsum += scalarSums[i]; }

  • SIMD

Datatypes

  • Array Cast
  • Sequential

Reduction

  • Loop Bound

Adaption

slide-27
SLIDE 27

Multicore vs. SIMD

  • SIMD Datatypes
  • Array Cast
  • Sequential Reduction
  • Loop Bound Adaption
  • Parallel Reduce
  • Block-Range
  • Per-Partition Lambda
  • Global Lambda

Particularly problematic when generating code

SIMD-Parallel Processing Multicore-Parallelism

slide-28
SLIDE 28

Query Compiler Architecture

SIMD-Parallel Processing Multicore-Parallelism Join-Ordering

slide-29
SLIDE 29

Optimizer Opti mizer

Query Compiler Architecture

SQL Logical Plan Executable OS/Hardware/…

DBMS Compiler

SIMD-Parallel Processing Multicore-Parallelism Join-Ordering

OS

slide-30
SLIDE 30

Optimizer Opti mizer

Query Compiler Architecture

SQL Logical Plan Executable OS/Hardware/…

DBMS

SIMD-Parallel Processing Multicore-Parallelism

Compiler

Join-Ordering

OS

Data aware Hardware aware

slide-31
SLIDE 31

Optimizer Optimizer

What if…

SQL Logical Plan Executable OS/Hardware/…

DBMS

SIMD-Parallel Processing Multicore-Parallelism

Compiler

Join-Ordering

OS

slide-32
SLIDE 32

Optimizer

What if…

SQL Unified Algebra Executable OS/Hardware/…

Logical Plan

SIMD-Parallel Processing Multicore-Parallelism Join-Ordering

slide-33
SLIDE 33

Optimizer

What if we had an intermediate algebra

SQL Unified Algebra Executable OS/Hardware/…

Logical Plan

SIMD-Parallel Processing Multicore-Parallelism Join-Ordering

Data & Hardware aware

slide-34
SLIDE 34

A portable high-performance database kernel A platform for database performance engineering A Vector Algebra [VLDB 2016/17]

slide-35
SLIDE 35

The Voodoo Vector Algebra

slide-36
SLIDE 36

Design goals

  • Fast and expressive like C
  • Optimizable like relational algebra
  • Portable to different devices
  • (Enable serendipitous discovery of new optimizations)
slide-37
SLIDE 37

Design goals

  • Fast and expressive like C
  • Optimizable like relational algebra
  • Portable to different devices
  • All tuning decisions are explicit in the program
  • Focused on data processing
  • Dataflow ⇒ Optimization are graph transformation rules
  • Least common denominator data model and operator set
slide-38
SLIDE 38

Data model: the least common denominator of targeted hardware

  • struct {int id; 


struct {
 int x; 
 int y
 } position;
 } poi[n]

id pos.x pos.y 15 7 9 21 4 5 11 10 14 92 7 1 78 45 12 65 12

slide-39
SLIDE 39

All Voodoo operators are parallel, some are controlled

  • Map-like (Fully Data Parallel):
  • Project, Zip, Arithmetic, Logical, Bit

Operations, Gather (i.e., z = x[y]), Cross

  • Controlled:
  • FoldSelect, FoldSum, FoldMax, FoldScan,

Scatter (i.e, x[y] = z), …

slide-40
SLIDE 40

A short revision on fold

  • Fold(f, [8,9,5,7,0,5,7,4,5])

f 5 8 9 f 7 f 5 f f 7 f f 5 f 4 f 7 f 5 f

slide-41
SLIDE 41

1 1 1 1 2 4 1 3 1 5 .partition .value

Control: explicit partition assignment

Foldsum

7 .sum ε 9

slide-42
SLIDE 42

Controlled folding - declarative & tunable

2 4 1 3 1 5 .value

Foldsum

7 .sum ε 9

slide-43
SLIDE 43

Controlled folding - declarative & tunable

2 4 1 3 1 5 .value

Foldsum Foldsum

16 .sum 7 .sum ε 9

slide-44
SLIDE 44

Creating Control Vectors

4 4 4 4 4 4 4 4 0 0 0 0 1 1 1 1 0 1 2 3 4 5 6 7

➗ = range(data) constant(data, 4) control vector

slide-45
SLIDE 45

1 input := Load("input") // Single Column: val 2 ids := Range(input) Constant

Partitioned aggregation in Voodoo is controlled

4 partitionIDs := Divide(ids, partitionSize) 5 positions := Partition(partitionIDs) 6 inputWPart := Zip(input, partition) 7 partInput := Scatter(inputWPart, positions) 8 pSum := FoldSum(partInput.val, partInput.partition) 9 totalSum := FoldSum(pSum) 2 ids := Range(input) 3 partitionSize := Constant(1024) 4 partitionIDs := Divide(ids, (1024) partitionSize)

slide-46
SLIDE 46

2 4 1 3 1 5 .value

Lanewise folding abstracts SIMD parallelism…

7 9 .sum

slide-47
SLIDE 47

Creating Control Vectors

2 2 2 2 2 2 2 2 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7

mod = range(data) constant(data, 2) partition IDs

slide-48
SLIDE 48

1 input := Load("input") // Single Column: val 2 ids := Range(input) 3 laneCount := Constant(2) 4 partitionIDs := Modulo(ids, laneCount) 5 positions := Partition(partitionIDs) 6 inputWPart := Zip(input, partition) 7 partInput := Scatter(inputWPart, positions) 8 pSum := FoldSum(partInput.val, partInput.partition) 9 totalSum := FoldSum(pSum) 10

…and looks almost the same as multicore-parallelism

1 input := Load("input") // Single Column: val 2 ids := Range(input) 3 partitionSize := Constant(1024) 4 partitionIDs := Divide(ids, partitionSize) 5 positions := Partition(partitionIDs) 6 inputWPart := Zip(input, partition) 7 partInput := Scatter(inputWPart, positions) 8 pSum := FoldSum(partInput.val, partInput.partition) 9 totalSum := FoldSum(pSum)

backend performs simple static analysis

slide-49
SLIDE 49

A portable high-performance database kernel

slide-50
SLIDE 50

Optimizer

The Voodoo query processing system

SQL Unified Algebra Intermediate Language OS/Hardware/…

Logical Plan

"Liberated" from MonetDB

Similar to code generation example

OpenCL Our Contributions

slide-51
SLIDE 51

Let’s extend our example

SELECT SUM(l_quantity) FROM lineitem WHERE l_shipdate> 5

slide-52
SLIDE 52

MonetDB generates a logical plan…

SELECT SUM(l_quantity) FROM lineitem WHERE l_shipdate> 5

sum select lineitem

slide-53
SLIDE 53

…we compile the logical plan to Voodoo…

tmp1 = Load(.lineitem.l_quantity) tmp2 = Load(.lineitem.l_shipdate) tmp3.val = Range(728659,tmp2,0) tmp4.val = Greater(tmp2,.l_shipdate,tmp3,.val) tmp5.val = Range(0,tmp4,1) tmp6 = Zip(.fold,tmp5,.val,.value,tmp4,.val) tmp7.val = FoldSelect(tmp6,.fold,.value) tmp8 = Gather(tmp1,tmp7,.val) tmp14.val = Range(0,tmp8,0) tmp15 = Zip(.fold,tmp14,.val,.value,tmp8,.val) tmp16.L1 = FoldSum(tmp15,.fold,.value) Return(tmp16)

sum select lineitem

slide-54
SLIDE 54

.lineitem.l_quantity .lineitem.l_shipdate Const Greater Range FoldSelect Gather Constant FoldSum

…i.e., a dataflow program

sum select lineitem

slide-55
SLIDE 55

.lineitem.l_quantity .lineitem.l_shipdate Const Greater Range FoldSelect Gather Constant FoldSum

Generating "undergraduate" C from Voodoo

for(size_t i = 0; i < sizeof(l_shipdate); i++) if( ) result +=

.lineitem.l_shipdate C pdate Const Greater .lineitem.l_quantity F Gather Constant FoldSum

l_shipdate[i] > 5

tity Range FoldSelect

l_quantity[i]

slide-56
SLIDE 56

The “graduate" version

.lineitem.l_quantity .lineitem.l_shipdate Const Greater Range FoldSelect Gather Range Constant Divide FoldSum Constant FoldSum

extern size_t inputSize; void fragment1(int* tmp, int* s_date, int* quantity) { for(size_t i = 0; i < grainsize; i++) if(s_date[pId * grainsize + i] > 5)

  • ut[pId] += quantity[pId * grainsize + i]

} void fragment2(int* out, int* tmp) { for(size_t i = 0; i < inputSize / grainsize; i++)

  • utput[0] += tmp[i]

}

control-incompatible folds → hierarchical aggregation Multicore-partitioned fold

slide-57
SLIDE 57

Voodoo outperforms MonetDB (TPC-H subset, SF10)

Monet Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 500 1000 1500 2000 TPC-H Query Time in ms ′

an old acquaintance

slide-58
SLIDE 58

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10)

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

slide-59
SLIDE 59

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10)

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

Generally Competitive

slide-60
SLIDE 60

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10)

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

Often faster

slide-61
SLIDE 61

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10)

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

Sometimes slower

slide-62
SLIDE 62

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10)

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

Q9 is an interesting case where tuning is important

slide-63
SLIDE 63

VOODOO AS A TUNING TOOL

slide-64
SLIDE 64

Selective Foreign-Key Joins

select sum(part.price) from lineitem, part where discount > $x and lineitem.partkey_fk = part.partkey_pk

select items that were sold at a discount greater than x look up their price from the parts table and sum it join

slide-65
SLIDE 65

Selective Foreign-Key Joins

select sum(part.price) from lineitem, part where discount > $x and lineitem.partkey_fk = part.partkey_pk

join select part lineitem

join lookup/Foreign-Key Join primary key, i.e., unique

slide-66
SLIDE 66

Selective Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

2 3 2 2 1 3 1 partkey .80 .93 .10 .75 price

Foreign-Key Index

lineitem

part

.1 .01 .06 .07 .5 .25 .0 .15 .03 discount

slide-67
SLIDE 67

Selective Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

2 3 2 2 1 3 1 partkey .80 .93 .10 .75 price

select read key/pointer resolve pointer/ read value

.1 .01 .06 .07 .5 .25 .0 .15 .03 discount ✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ >.05

slide-68
SLIDE 68

Selective Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-69
SLIDE 69

Selective Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-70
SLIDE 70

Branching Foreign-Key Joins

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Range Zip FoldSelect Gather Gather Range Zip FoldSum return

for(size_t i = 0; i < lineitemsize; i++) if(discount[i] > $1) result += price[partkey[i]];

select sum(price) from lineitem, part where discount > $1 and partkey = partkey

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity Time in seconds

Key-Lookup Selection

slide-71
SLIDE 71

Predicated Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) if(discount[i] > $1) result += price[partkey[i]];

slide-72
SLIDE 72

Predicated Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) result += (discount[i] > $1) * price[partkey[i]];

slide-73
SLIDE 73

Selective Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-74
SLIDE 74

Predicated Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-75
SLIDE 75

Predicated Foreign-Key Joins

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Range Zip FoldSelect Gather Gather Range Zip FoldSum return

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return

Multiply

FoldSum( Gather( Gather( FoldSelect(x),y),z)) =>FoldSum( Multiply(x, Gather(y,z)))

slide-76
SLIDE 76

Selective Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) result += (price[i] > $1) * p_discount[partkey[i]];

select sum(price) from lineitem, part where discount > $1 and partkey = partkey

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return

slide-77
SLIDE 77

Selective Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) result += (price[i] > $1) * p_discount[partkey[i]];

select sum(price) from lineitem, part where discount > $1 and partkey = partkey

Branching Predicated Aggregation

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity Time in seconds

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return

slide-78
SLIDE 78

Double-predicated Foreign-Key Joins

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return .part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Multiply Gather Multiply Range Zip FoldSum return

FoldSum( Multiply( Gather( x,y),z)) =>FoldSum( Multiply( Gather( Multiply(x,z),y),z))

slide-79
SLIDE 79

Predicated Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-80
SLIDE 80

Predicated Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-81
SLIDE 81

Double-predicated Foreign-Key Joins

select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

✔ ✘ ✔ ✔ ✔ ✔ ✘ ✔ ✘ 2 3 2 2 1 3 1 discount partkey .80 .93 .10 .75 price

slide-82
SLIDE 82

Double-predicated Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) result += (discount[i] > $1) * price[partkey[i]];

select sum(price) from lineitem, part where discount > $1 and partkey = partkey

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Multiply Gather Multiply Range Zip FoldSum return

slide-83
SLIDE 83

Double-predicated Foreign-Key Joins

for(size_t i = 0; i < lineitemsize; i++) result += (discount[i] > $1) * price[partkey[i] * (discount[i] > $1)];

select sum(price) from lineitem, part where discount > $1 and partkey = partkey

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Multiply Gather Multiply Range Zip FoldSum return

Branching Predicated Aggregation Predicated Lookups

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity Time in seconds

slide-84
SLIDE 84

Double-predicated Foreign-Key Joins

Branching Predicated Aggregation Predicated Lookups

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity Time in seconds

slide-85
SLIDE 85

Voodoo Wrap

  • Fast and expressive like C
  • Optimizable like relational algebra
  • Portable to different devices
  • Enables serendipitous discovery 

  • f new optimizations

Optimizer SQL Voodoo Executable OS/Hardware/…

Logical Plan

slide-86
SLIDE 86

Quo Vadis

slide-87
SLIDE 87

FUTURE WORK

Optimizer

SQL Voodoo Executable OS/Hardware/…

Relational Algebra

slide-88
SLIDE 88

FUTURE WORK

Optimizer

SQL Voodoo Executable OS/Hardware/…

Relational Algebra

Graphs Arrays

Tensorflow Streams Domain Specific

[VLDB 2018]

Weld

slide-89
SLIDE 89

QUO VADIS

Optimizer

SQL Voodoo Executable OS/Hardware/…

Relational Algebra

slide-90
SLIDE 90

Plugging holes in the Design Space

120 162 151 63 158 42 47 38 154 76 76 85 420 222 155 591 18 25 73 103 45 30 74 191 279 56 64 96

HyPeR Voodoo

1 4 5 6 7 8 9 10 11 12 14 15 19 20 100 200 300 400 500 600 700 Time in ms

TPC-H Query

Remember these?

slide-91
SLIDE 91

Plugging holes in the Design Space

  • Massively Parallel Top-K [SIGMOD 2018]
  • Incremental Window Computation [ongoing]
  • Dynamic Load Balancing on GPUs [future]
slide-92
SLIDE 92

Auto-Generating Databases

Branching Predicated Aggregation Predicated Lookups

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity Time in seconds

slide-93
SLIDE 93

Auto-Generating Databases

Branching Predicated Aggregation Predicated Lookups

20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 Selectivity in % Time in seconds

slide-94
SLIDE 94

Auto-Generating Databases

  • Hardware-conscious modelling & tuning [ongoing]
slide-95
SLIDE 95

Auto-Generating Databases

DBMS Indexing Processing … Hardware

DBMS Indexing Processing Hardware

Self-tuning Self-driving Self-generating

DBMS Hardware

CPU Core 1 L1 Cache TLB Last Level (L3) Cache

Memory

Registers Core 2 L1 Cache TLB L2 Cache Registers

  • n die
  • ff die

L2 Cache

Indexing Processing DBMS Hardware

CPU Core 1 L1 Cache TLB Last Level (L3) Cache

Memory

Registers Core 2 L1 Cache TLB L2 Cache Registers

  • n die
  • ff die

L2 Cache

Indexing Processing

Hardware Model Voodoo Program Synthesis

slide-96
SLIDE 96

Auto-Generating Databases

  • Hardware-conscious modelling & tuning [ongoing]
  • Auto-Generating Databases [future]
slide-97
SLIDE 97

Wrapping up

slide-98
SLIDE 98

Thanks to Collaborators

Sam Madden Mike Stonebraker Anil Shanbhag Malte Schwarzkopf Matei Zaharia Shoumik Palkar

slide-99
SLIDE 99

A new DB research hub — Visit us!

Hyde Park Buckingham Palace London Eye Westminster Imperial College

slide-100
SLIDE 100

Recruiting

  • Interns
  • PhDs
  • Postdocs
slide-101
SLIDE 101

Thank You

slide-102
SLIDE 102

Backup slides

slide-103
SLIDE 103

Domain-Conscious Database Design

merge the sequence ACCCTA into the graph while minimizing the number of new branches count the number of mutations of a gene find the most common allele of a gene T A G A C G A C T C A A C A G C T A

Database: Queries: Multiple sequence alignment

slide-104
SLIDE 104

Predicated Foreign-Key Joins

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Range Zip FoldSelect Gather Gather Range Zip FoldSum return

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return

Multiply

FoldSum( Gather( Gather( FoldSelect(x),y),z)) =>FoldSum( Multiply(x, Gather(y,z)))

slide-105
SLIDE 105

Double-predicated Foreign-Key Joins

.part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Gather Multiply Range Zip FoldSum return .part.p_retailprice .lineitem.lineitem_part .lineitem.l_discount Range Greater Multiply Gather Multiply Range Zip FoldSum return

FoldSum( Multiply( Gather( x,y),z)) =>FoldSum( Multiply( Gather( Multiply(x,z),y),z))

slide-106
SLIDE 106

TPC-H Query 9 in Hyper

  • Select (on dimension/target table) & Join Query
  • HyPeR strategy: select scan + hash-join
  • Ignores foreign key index
  • Voodoo strategy: select scan, build bitmap, join using FK-Index
  • Takes advantage of foreign key index
slide-107
SLIDE 107

294 347 102 213 288

  • 13

13 208 184 170 61 37 47

Voodoo Ocelot

1 4 5 6 8 12 19 50 100 150 200 250 300 350 Time in ms

TPC-H ON GPUS

slide-108
SLIDE 108

Why Exploration? Why not machine Learning?

  • Because it takes ML-experts
  • There often isn’t enough data (e.g., for catastrophic events)
  • It is slow (in particular the training)
slide-109
SLIDE 109

…when generating code

  • LLVM
  • No Multithreading support
  • SIMD support is best-effort
  • No transactional memory support
  • No real GPU support
  • No NUMA support