HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk - PowerPoint PPT Presentation

…and looks almost the same as multicore-parallelism 1 input := Load ("input") // Single Column: val 2 ids := Range (input) 3 partitionSize := Constant (1024) 4 partitionIDs := Divide (ids, partitionSize) 5 positions := Partition (partitionIDs) 6 inputWPart := Zip (input, partition) 7 partInput := Scatter (inputWPart, positions) 8 pSum := FoldSum (partInput.val, partInput.partition) backend performs 9 totalSum := FoldSum (pSum) simple static analysis 1 input := Load ("input") // Single Column: val 2 ids := Range (input) 3 laneCount := Constant (2) 4 partitionIDs := Modulo (ids, laneCount) 5 positions := Partition (partitionIDs) 6 inputWPart := Zip (input, partition) 7 partInput := Scatter (inputWPart, positions) 8 pSum := FoldSum (partInput.val, partInput.partition) 9 totalSum := FoldSum (pSum) 10

A portable high-performance database kernel

The Voodoo query processing system SQL "Liberated" from MonetDB Similar to code generation example Logical Plan Optimizer Unified Algebra Our Contributions Intermediate Language OpenCL OS/Hardware/…

Let’s extend our example SELECT SUM(l_quantity) FROM lineitem WHERE l_shipdate> 5

MonetDB generates a logical plan… sum SELECT SUM(l_quantity) select FROM lineitem WHERE l_shipdate> 5 lineitem

…we compile the logical plan to Voodoo… tmp1 = Load(.lineitem.l_quantity) sum tmp2 = Load(.lineitem.l_shipdate) tmp3.val = Range(728659,tmp2,0) tmp4.val = Greater(tmp2,.l_shipdate,tmp3,.val) select tmp5.val = Range(0,tmp4,1) tmp6 = Zip(.fold,tmp5,.val,.value,tmp4,.val) tmp7.val = FoldSelect(tmp6,.fold,.value) lineitem tmp8 = Gather(tmp1,tmp7,.val) tmp14.val = Range(0,tmp8,0) tmp15 = Zip(.fold,tmp14,.val,.value,tmp8,.val) tmp16.L1 = FoldSum(tmp15,.fold,.value) Return(tmp16)

…i.e., a dataflow program Const .lineitem.l_shipdate Greater sum Range .lineitem.l_quantity FoldSelect select Gather lineitem Constant FoldSum

Generating "undergraduate" C from Voodoo C Const Const .lineitem.l_shipdate .lineitem.l_shipdate pdate Greater Greater for(size_t i = 0; i < sizeof(l_shipdate); i++) l_shipdate[i] > 5 if( ) Range Range l_quantity[i] result += .lineitem.l_quantity .lineitem.l_quantity tity F FoldSelect FoldSelect Gather Gather Constant Constant FoldSum FoldSum

The “graduate" version .lineitem.l_shipdate Const Greater extern size_t inputSize; Range void fragment1(int* tmp, int* s_date, int* quantity) { for(size_t i = 0; i < grainsize; i++) .lineitem.l_quantity FoldSelect if(s_date[pId * grainsize + i] > 5) out[pId] += quantity[pId * grainsize + i] Gather } void fragment2(int* out, int* tmp) { Range Constant for(size_t i = 0; i < inputSize / grainsize; i++) output[0] += tmp[i] Divide } FoldSum Multicore-partitioned fold control-incompatible folds Constant → hierarchical aggregation FoldSum

Voodoo outperforms MonetDB (TPC-H subset, SF10) ′ 2000 Monet Voodoo 1500 Time in ms 1000 500 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC - H Query an old acquaintance

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10) 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10) Generally Competitive 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10) Often faster 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10) Sometimes slower 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

Voodoo is competitive with Hyper on CPU (selected TPC-H queries, SF10) Q9 is an interesting case where tuning is important 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

VOODOO AS A TUNING TOOL

Selective Foreign-Key Joins join select sum(part.price) from lineitem, part where discount > $x and lineitem.partkey_fk = part.partkey_pk select items that were sold at a discount greater than x look up their price from the parts table and sum it

Selective Foreign-Key Joins lookup/Foreign-Key Join join join select sum(part.price) from lineitem, part where discount > $x select part and lineitem.partkey_fk = part.partkey_pk lineitem primary key, i.e., unique

Selective Foreign-Key Joins lineitem part price partkey discount 0 .80 .1 2 .93 .01 0 .10 .06 3 .75 .07 2 .5 2 .25 Foreign-Key Index 1 .0 3 .15 1 .03 select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Selective Foreign-Key Joins read key/pointer resolve pointer/ select read value price >.05 partkey discount 0 .80 .1 ✔ 2 .93 .01 ✘ 0 .10 .06 ✔ 3 .75 .07 ✔ 2 .5 ✔ 2 .25 ✔ 1 .0 ✘ 3 .15 ✔ 1 .03 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Selective Foreign-Key Joins price discount partkey 0 .80 ✔ 2 .93 ✘ 0 .10 ✔ 3 .75 ✔ 2 ✔ 2 ✔ 1 ✘ 3 ✔ 1 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Branching Foreign-Key Joins .lineitem.l_discount Range select sum(price) Selection Key-Lookup from lineitem, part Greater where discount > $1 and partkey = partkey for(size_t i = 0; i < lineitemsize; i++) Range if(discount[i] > $1) result += price[partkey[i]]; Zip .lineitem.lineitem_part FoldSelect .part.p_retailprice Gather 0.25 0.20 Time in seconds Gather 0.15 Range 0.10 0.05 Zip 0.00 20 40 60 80 100 FoldSum Selectivity return

Predicated Foreign-Key Joins for(size_t i = 0; i < lineitemsize; i++) if(discount[i] > $1) result += price[partkey[i]];

Predicated Foreign-Key Joins for(size_t i = 0; i < lineitemsize; i++) result += (discount[i] > $1) * price[partkey[i]];

Selective Foreign-Key Joins price discount partkey 0 .80 ✔ 2 .93 ✘ 0 .10 ✔ 3 .75 ✔ 2 ✔ 2 ✔ 1 ✘ 3 ✔ 1 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Predicated Foreign-Key Joins price discount partkey 0 .80 ✔ 2 .93 ✘ 0 .10 ✔ 3 .75 ✔ 2 ✔ 2 ✔ 1 ✘ 3 ✔ 1 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Predicated Foreign-Key Joins .lineitem.l_discount Range .lineitem.l_discount Greater .part.p_retailprice .lineitem.lineitem_part Range Range Zip Gather Greater FoldSum( .lineitem.lineitem_part FoldSelect Gather( Range Multiply Gather( .part.p_retailprice Gather Zip FoldSelect(x),y),z)) Gather =>FoldSum( FoldSum Multiply(x, Range Multiply return Gather(y,z))) Zip FoldSum return

Selective Foreign-Key Joins select sum(price) from lineitem, part where discount > $1 and partkey = partkey .lineitem.l_discount for(size_t i = 0; i < lineitemsize; i++) .part.p_retailprice .lineitem.lineitem_part Range result += (price[i] > $1) * p_discount[partkey[i]]; Gather Greater Range Multiply Zip FoldSum return

Selective Foreign-Key Joins select sum(price) from lineitem, part where discount > $1 and partkey = partkey .lineitem.l_discount for(size_t i = 0; i < lineitemsize; i++) .part.p_retailprice .lineitem.lineitem_part Range result += (price[i] > $1) * p_discount[partkey[i]]; Gather Greater 0.25 Range Multiply 0.20 Time in seconds 0.15 Zip 0.10 Branching FoldSum Predicated Aggregation 0.05 return 0.00 20 40 60 80 100 Selectivity

Double-predicated Foreign-Key Joins .lineitem.l_discount .lineitem.l_discount Range .part.p_retailprice .lineitem.lineitem_part Range .lineitem.lineitem_part Greater .part.p_retailprice Multiply Gather Greater FoldSum( Multiply( Range Multiply Gather Gather( Zip x,y),z)) Range Multiply =>FoldSum( FoldSum Zip Multiply( return Gather( FoldSum Multiply(x,z),y),z)) return

Predicated Foreign-Key Joins price discount partkey 0 .80 ✔ 2 .93 ✘ 0 .10 ✔ 3 .75 ✔ 2 ✔ 2 ✔ 1 ✘ 3 ✔ 1 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Double-predicated Foreign-Key Joins price discount partkey 0 .80 ✔ 2 .93 ✘ 0 .10 ✔ 3 .75 ✔ 2 ✔ 2 ✔ 1 ✘ 3 ✔ 1 ✘ select sum(price) from lineitem, part where discount > $1 and lineitem.partkey = part.partkey

Double-predicated Foreign-Key Joins select sum(price) .lineitem.l_discount from lineitem, part where discount > $1 Range and partkey = partkey for(size_t i = 0; i < lineitemsize; i++) .lineitem.lineitem_part Greater result += (discount[i] > $1) * price[partkey[i]]; .part.p_retailprice Multiply Gather Range Multiply Zip FoldSum return

Double-predicated Foreign-Key Joins select sum(price) .lineitem.l_discount from lineitem, part where discount > $1 Range and partkey = partkey for(size_t i = 0; i < lineitemsize; i++) .lineitem.lineitem_part Greater result += (discount[i] > $1) * price[partkey[i] * (discount[i] > $1)]; .part.p_retailprice Multiply Gather 0.25 0.20 Range Multiply Time in seconds 0.15 Zip 0.10 Branching Predicated Aggregation FoldSum 0.05 Predicated Lookups 0.00 return 20 40 60 80 100 Selectivity

Double-predicated Foreign-Key Joins 0.25 0.20 Time in seconds 0.15 0.10 Branching Predicated Aggregation 0.05 Predicated Lookups 0.00 20 40 60 80 100 Selectivity

Voodoo Wrap SQL • Fast and expressive like C Logical Plan • Optimizable like relational algebra Voodoo Optimizer • Portable to different devices Executable • Enables serendipitous discovery   of new optimizations OS/Hardware/…

Quo Vadis

FUTURE WORK SQL Relational Algebra Optimizer Voodoo Executable OS/Hardware/…

FUTURE WORK Domain Streams Specific Tensorflow Arrays Graphs SQL Weld Relational Algebra Optimizer Voodoo Executable OS/Hardware/… [VLDB 2018]

QUO VADIS SQL Relational Algebra Optimizer Voodoo Executable OS/Hardware/…

Plugging holes in the Design Space Remember these? 700 HyPeR Voodoo 591 600 500 Time in ms 420 400 279 300 222 191 200 162 151 158 155 154 120 103 96 76 85 76 73 74 100 63 64 56 47 38 45 30 42 18 25 0 1 4 5 6 7 8 9 10 11 12 14 15 19 20 TPC-H Query

Plugging holes in the Design Space • Massively Parallel Top-K [SIGMOD 2018] • Incremental Window Computation [ongoing] • Dynamic Load Balancing on GPUs [future]

Auto-Generating Databases 0.25 0.20 Time in seconds 0.15 0.10 Branching Predicated Aggregation 0.05 Predicated Lookups 0.00 20 40 60 80 100 Selectivity

Auto-Generating Databases 0.25 0.20 Time in seconds 0.15 0.10 Branching Predicated Aggregation 0.05 Predicated Lookups 0.00 0 20 40 60 80 100 Selectivity in %

Auto-Generating Databases • Hardware-conscious modelling & tuning [ongoing]

Auto-Generating Databases DBMS DBMS Hardware Model Indexing Processing Indexing Processing … Voodoo Program Hardware Hardware Self-driving Synthesis Self-tuning CPU CPU DBMS DBMS Core 1 Core 1 Core 2 Core 2 Registers Registers Registers Registers L1 Cache L1 Cache TLB TLB L1 Cache L1 Cache TLB TLB Indexing Indexing L2 Cache L2 Cache L2 Cache L2 Cache Last Level (L3) Cache Last Level (L3) Cache Processing Processing on die on die off die off die Memory Memory Hardware Hardware Self-generating

Auto-Generating Databases • Hardware-conscious modelling & tuning [ongoing] • Auto-Generating Databases [future]

Wrapping up

Thanks to Collaborators Sam Madden Matei Zaharia Mike Stonebraker Shoumik Palkar Anil Shanbhag Malte Schwarzkopf

A new DB research hub — Visit us! London Eye Westminster Hyde Park Buckingham Palace Imperial College

Recruiting • Interns • PhDs • Postdocs

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk - PowerPoint PPT Presentation

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data Processing Performance - A Case Study Data Processing Performance - A Case Study 100 75 50 25 0 April May June July select sum(sales),

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Chapter 4 Marketing ethics Today Conscious marketing Ethical marketing decision 2

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Mikeys Matthew Gunton, Joseph Min, Dave Jha, and Aidan Gales What is a Price Conscious

CONSCIOUS CAPITALISM Raj Sisodia Director, Mastek Ltd. Professor of Marketing, Bentley

ONE GLOBE ONE PEOPLE The conscious human the world needs you Programme 19.00 to 22.00 The

School to Heart to Home Conscious Discipline Parent Workshop Series 1. Composure The 7 2.

School to Heart to Home Conscious Discipline Parent Workshop Series What are the outcomes of the

IDENTITY BRANDING The Effect of Packaging Design for Eco-Conscious Cleaning Products on

C OLUMN D ATABASES A N D R E W C R O T T Y & A L E X G A L A K AT O S O UTLINE RDBMS

Week 6 Discussion Wednesday, 11/6/19 Reminders PSA4 Submission due Tuesday, November 12 11:59pm

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

World: a 20 meter cube around this pit, centered at the star. Pit Tube Z=10 meters 4.35

BCM Performance Study and Calibrations for F2/EMC Spring 2018 Experiment F2-EMC Collaboration

STM32L476 Parallel I/O Ports References: STM32L4x6 Reference Manual STM32L476xx Data Sheet

Corrections and Transfers (Updated 4/14/12) Bill Cahill 4-5124 bpcahil@iastate.edu Agenda

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk - PowerPoint PPT Presentation

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data Processing Performance - A Case Study Data Processing Performance - A Case Study 100 75 50 25 0 April May June July select sum(sales),

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Chapter 4 Marketing ethics Today Conscious marketing Ethical marketing decision 2

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Mikeys Matthew Gunton, Joseph Min, Dave Jha, and Aidan Gales What is a Price Conscious

CONSCIOUS CAPITALISM Raj Sisodia Director, Mastek Ltd. Professor of Marketing, Bentley

ONE GLOBE ONE PEOPLE The conscious human the world needs you Programme 19.00 to 22.00 The

School to Heart to Home Conscious Discipline Parent Workshop Series 1. Composure The 7 2.

School to Heart to Home Conscious Discipline Parent Workshop Series What are the outcomes of the

IDENTITY BRANDING The Effect of Packaging Design for Eco-Conscious Cleaning Products on

C OLUMN D ATABASES A N D R E W C R O T T Y &amp; A L E X G A L A K AT O S O UTLINE RDBMS

Week 6 Discussion Wednesday, 11/6/19 Reminders PSA4 Submission due Tuesday, November 12 11:59pm

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

World: a 20 meter cube around this pit, centered at the star. Pit Tube Z=10 meters 4.35

BCM Performance Study and Calibrations for F2/EMC Spring 2018 Experiment F2-EMC Collaboration

STM32L476 Parallel I/O Ports References: STM32L4x6 Reference Manual STM32L476xx Data Sheet

Corrections and Transfers (Updated 4/14/12) Bill Cahill 4-5124 bpcahil@iastate.edu Agenda

C OLUMN D ATABASES A N D R E W C R O T T Y & A L E X G A L A K AT O S O UTLINE RDBMS