From HyPer to Hyper Integrating an academic DBMS into a leading - PowerPoint PPT Presentation

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com

The Story of Hyper

2020: Hyper as a general-purpose Database Service in the Tableau Platform Mid 2019: Hyper API exposes full Hyper SQL capabilities to partners and customers Mid 2018: Tableau Prep launches with Hyper as its processing engine Early 2018: Hyper replaces Tableau Data Engine in existing on-prem and SaaS products Tableau Europe R&D center in Munich with over 30 full-time employees March 2016: Tableau acquires HyPer (HyPer ⇨ Hyper) Commercial spin-off Academic Success: More than 50 peer-reviewed publications and several awards 2008: HyPer started as a research project at Technical University of Munich

Context: One Size Fits All? One size does not fit all, but what about data lag and disconnect?

Context: A Changing Hardware Landscape In order to leverage modern hardware, databases need to change .

The Idea Behind Hyper

Inside Hyper tuple PSMAs JIT-compiled c o l d s c a tuple-at-a-time query pipeline n C push matches vectorized evaluation of Registers A B D single interface tuple-at-a-time A SARGable predicates on D vector compressed data and B Caches unpacking of matches compressed interpreted vectorized scan Data Block DRAM on Data Block Index t o h tuple-at-a-time NVRAM n a c s evaluation of scan predicates vectorized evaluation of A B C D A B D Local SSD/HDD SARGable predicates and uncompressed vector copying of matches chunk Network A B C D vectors of e.g., 8192 records uncompressed interpreted vectorized scan JIT-compiled tuple-at-a-time scan chunk on uncompressed chunk on uncompressed chunk Optimized for Storage Hierarchy Query Compilation & Vectorized Scans s n e o g HT(S) r t i n o ( i n HT(T) s a t o r c o t P h e V s d t Accounts t g n e ) o n t o r A B n o e s i version information stored in additional hidden column in base relation e B C l a s d i Owner Bal r s d , l e r t – Result probe(8) e s V (indexes only store references, i.e., rowIDs, to records in base relations) e x f V r o 16 8 i i f f r 8 v a [ c o i l R n Thomas 10 m l A u Z A B C I s c m t latest version in-place t i probe(16) r o 33 x i t a n T t probe(10) Larry 10 i i s A Z m o a 16 8 v store n 18 33 e ) physical before-image deltas (i.e., column values) 10 y [0,0) recentlyCommitted Alfons 10 16 a ... ... ... ... in undo buffers Sally à Wendy T3 27 10 Judy 10 Undo buffer of Ty (Sally à ...) 7 c ... ... ... ... T5 Sally à Henry probe(27) store Tobias 10 Ty,Bal,8 10 i 5 5 5 z Sally 7 Z A B C 27 b Hanna 10 b 27 10 y 18 e t 7 23 [0,1) r Hasso 10 ( 23 u a n n o A ... ... ... ... Undo buffer of T5 s s 5 j t t c a a s t c r i Mike 10 t t o i t o o T n r T5,Bal,9 T5,Bal,10 n i e s ... ... ... ... 7 d m I d D e Lisa 10 ) activeTransactions morsel 5 f Tx T4 Read only: Σ Betty 10 ... ... Sally à Mike Ty T6 Cindy 10 Undo buffer of T3 ... ... Tz T7 Read only: Σ [2,5) Henry 11 T3,Bal,10 T3,Bal,10 Dispatcher morsel ... ... Praveen 10 ... ... Wendy 11 ... ... main-memory column-store Advanced Query Optimization Fast MVCC Morsel-Driven Parallelization

Hyper in Tableau Desktop Online Public Prep Server Federation Prep Extract Hyper API

From HyPer to Hyper: Challenges Support • Limited support provided up to 30 months after major product version release Compare performance across releases and database engines • Semantic differences • Infrastructure Windows, Linux, and macOS • Small laptops to large-scale servers and Cloud deployments • Workload Long tail of query complexity generated by Tableau • Wide variety of data set characteristics •

Tableau Workloads

What Tableau Workloads Look Like Most queries are “small”: Only 0.5% larger than 5KB SQL Text • But: Huge outliers • Largest query in our data set: 6.7MB • Largest query we saw so far: 27MB • And that’s not all due to constant strings…

Need a query plan visualizer? https://github.com/tableau/query-graphs/

What Tableau Workloads Look Like 1 273 237 254 164 72 36 2 253 253 252 3 233 233 232 4 123 123 122 Table Scan 5 81 81 80 Group By Sort 6 67 67 66 Percentiles 132 7 Inner Join 127 8 Outer Join Group Join 126 9 10 0 100 200 300 400 500 600 700 800 900 1000 Number of relational operators Vogelsgesang et al.: Get Real: How Benchmarks Fail to Represent the Real World. DBTest’18

Replacing Tableau’s Old Data Engine

Replacing Tableau’s Old Data Engine Tableau’s old data engine (TDE): vector-based engine inspired by MonetDB/X100 First step: Replace TDE as the backend of all Tableau Products Goals: Deliver performance at scale • Seamless transition for customers •

TDE: The Gold Standard Having a gold standard is great! We just ran a lot (60k) of workbooks from Tableau Public and compared results. Simple, measurable goal: Get results to match for all and be fast, then we’re done J

Challenge: Bug Compatibility? Is it really worth to show the same result for all queries? Non-deterministic behavior (parallelization!) • Bugs in TDE • Our attitude changed over time: 1. 100% same results at all cost, customers don’t want their Viz to change! This is non-negotiable! 2. Well, but what if it changes to be correct? 3. Who said that Visualizations can’t change in the first place? 4. Let’s do the right thing and fix things once and for all!

Compatibility Curiosity: String to Date Cast Is “ 5/7/2020 ” April 7 or July 5? TDE’s strategy : be aggressive finding a valid date. Sounds great! But horrible in the relational model! In Input ut TDE TDE Hy Hyper 5/7/2020 April 7, 2020 April 7, 2020 15/7/2020 July 15, 2020 NULL Silent failure: Sales workbook: More sales in the first 12 days of each months!

Improvement over TDE T T T T T T T T T T D D D D D D D D D D E E E E E E E E E E Continuous performance improvements past launch

Testing Hyper

SQL Level Testing query N expectConstantResult SELECT DATE '2001-09-28' + INTEGER '7' ---- 2001-10-05 We started with the SQLite test suite # DATE + INTEGER -> DATE with overflow Added own test cases for features • statement error 22003 Added regression tests for defects • SELECT DATE '4713-01-01 BC' - 1 Added fine grained expectations • (e.g., constant folding) How to execute the tests? First: Own Hyper front end that parses the file and executes the queries: • Problem: Server / protocol code not tested • Second: Client (based on libpq) that parses the file and sends queries to a server • Problem: Harder to debug, test driver is not same process •

Testing even more with SQL Level Tests Use EXPLAIN statement to test optimizer # Deduplication for simple domain queries query S EXPLAIN SELECT * FROM Introduce function to scan the own log to test (SELECT a FROM t GROUP BY 1) t1, for log messages (SELECT a FROM t GROUP BY 1) t2 ---- Introduce trace settings that allow printing • executiontarget(1) specific internals to the log join(2) bnl Special test functions with side effects to explicitscan(3) explicitscan(6)->(4) test further internals E.g., a function that allocates thread-local • groupby(4) memory tablescan(5) SELECT suicide() • t

SQL vs. C++ Unit Tests SQL C++ Easy to write, usually very succinct Only test the code in question, No recompilation needed not the whole SQL layer In vivo: Can run code in specific query contexts Good test failure reporting Can update expected test results automatically Runs faster Resembles customer usage of the system Big controversy!

Beyond SQL Testing exec CREATE DATABASE mytestdb; connection mytestdb user=bob { repeat 100 { SQLite tests are great, but they can’t exec CREATE TABLE foo AS (SELECT x FROM simulate load from multiple generate_series(1,1e6) x); connections parallel 10 { exec UPDATE foo SET x = x+10 WHERE x % 2 = 0; } and 2 { Solution: Loadtest DSL test { open connection • query N SELECT SUM(x) FROM foo Embed SQLite test statement • WHERE x % 2 = 1; Execute code blocks in parallel • ---- 1234567 Loops • } } }

A/B Testing: MaxPerf / QueryRunner 1. Goal: Check for compatibility with TDE I. Correctness and perf II. A/B Test on the 60k Public Workbooks Improvement A over B 2. How it works I. Starts up Hyper II. Loads a workbook (thus sending queries to Hyper) III. Checks number of marks / mark values and records times IV. A/B test between old and new branch

Automatic Regression Testing 1. Measure perf on every commit 2. If perf regresses, file a defect 3. Make sure it’s not just noise 4. If perf improves, make it the new expectation

Fuzzing 1. Use a fuzzer (e.g., AFL) 2. Feed it the SQL grammar 3. Let it run for a long time Found several vulnerabilities and defects with fuzzing!

From HyPer to Hyper Integrating an academic DBMS into a leading - PowerPoint PPT Presentation

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com The Story of Hyper 2020: Hyper as a general-purpose Database

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF & Kavli IPMU TAUP 2019, Toyama,

"ENLIGHTENING" KVM "ENLIGHTENING" KVM HYPER-V EMULATION HYPER-V EMULATION

Hyper-Kam iokande Project I ts Physics Potential Astroparticle Physics 2014 Hiroyuki Sekiya

Logics for Hyperproperties Martin Zimmermann Saarland University May, 19th 2017 Centre

HyperPCTL: A Temporal Logic for Probabilistic Hyperproperties Erika am 1 Borzoo Bonakdarpour 2

Perceptual Asymmetries in Learning Vowel Nasalization Kim Strtjen, Ruben van de Vijver, Dinah

Verifying Hyperproperties of Hardware Systems Bernd Finkbeiner Markus N. Rabe Saarland

Hyper-Cube High-Dimensional Hypervisor Fuzzing Sergej Schumilo, Cornelius Aschermann, Ali Abbasi,

COMPUTING WITH HYPERVECTORS Pentti Kanerva Redwood Center for Theoretical Neuroscience UC

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe Jenatton , C edric

PDE-Constrained Optimization Using Hyper-Reduced Models Matthew J. Zahr and Charbel Farhat

From HyPer to Hyper Integrating an academic DBMS into a leading - PowerPoint PPT Presentation

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com The Story of Hyper 2020: Hyper as a general-purpose Database

Hyper: Make VM Runs Like Container Xu Wang &lt;xu@hyper.sh&gt; Hyper HQ Agenda Lesson

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF &amp; Kavli IPMU TAUP 2019, Toyama,

&quot;ENLIGHTENING&quot; KVM &quot;ENLIGHTENING&quot; KVM HYPER-V EMULATION HYPER-V EMULATION

Hyper-Kam iokande Project I ts Physics Potential Astroparticle Physics 2014 Hiroyuki Sekiya

Logics for Hyperproperties Martin Zimmermann Saarland University May, 19th 2017 Centre

HyperPCTL: A Temporal Logic for Probabilistic Hyperproperties Erika am 1 Borzoo Bonakdarpour 2

Perceptual Asymmetries in Learning Vowel Nasalization Kim Strtjen, Ruben van de Vijver, Dinah

Verifying Hyperproperties of Hardware Systems Bernd Finkbeiner Markus N. Rabe Saarland

Hyper-Cube High-Dimensional Hypervisor Fuzzing Sergej Schumilo, Cornelius Aschermann, Ali Abbasi,

COMPUTING WITH HYPERVECTORS Pentti Kanerva Redwood Center for Theoretical Neuroscience UC

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe Jenatton , C edric

PDE-Constrained Optimization Using Hyper-Reduced Models Matthew J. Zahr and Charbel Farhat

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

Near Detectors for the Hyper-K Experiment Mark Hartz TRIUMF & Kavli IPMU TAUP 2019, Toyama,

"ENLIGHTENING" KVM "ENLIGHTENING" KVM HYPER-V EMULATION HYPER-V EMULATION