From HyPer to Hyper
Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com
From HyPer to Hyper Integrating an academic DBMS into a leading - - PowerPoint PPT Presentation
From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com The Story of Hyper 2020: Hyper as a general-purpose Database
Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com
2008: HyPer started as a research project at Technical University of Munich Academic Success: More than 50 peer-reviewed publications and several awards Commercial spin-off March 2016: Tableau acquires HyPer (HyPer ⇨ Hyper) Tableau Europe R&D center in Munich with over 30 full-time employees Early 2018: Hyper replaces Tableau Data Engine in existing on-prem and SaaS products Mid 2018: Tableau Prep launches with Hyper as its processing engine 2020: Hyper as a general-purpose Database Service in the Tableau Platform Mid 2019: Hyper API exposes full Hyper SQL capabilities to partners and customers
Registers Caches DRAM NVRAM Local SSD/HDD Network
Optimized for Storage Hierarchy Advanced Query Optimization Morsel-Driven Parallelization
uncompressed chunk A B C D vector B compressed Data Block A B C D A D interpreted vectorized scan
vectorized evaluation of SARGable predicates on compressed data and unpacking of matches push matches tuple-at-a-time
vector B uncompressed chunk A B C D A D interpreted vectorized scan
vectorized evaluation of SARGable predicates and copying of matches
JIT-compiled tuple-at-a-time scan
tuple-at-a-time evaluation of scan predicates vectors of e.g., 8192 records
JIT-compiled tuple-at-a-time query pipeline
c
d s c a n h
s c a n
tuple
single interface
Index
PSMAs
Query Compilation & Vectorized Scans
Thomas Larry Alfons Judy Tobias Sally Hanna Hasso Mike Lisa Betty Cindy Henry Praveen Wendy 10 10 10 10 10 7 10 10 10 10 10 10 11 10 11 Accounts Owner Bal V e r s iFast MVCC
A 16 18 27 5 7 B 8 33 10 5 23 B 8 33 10 5 23 C v x y z u
HT(S) HT(T)
A 16 7 10 27 18 5 7 5 ... ... ... ... ... Z a c i b e j d f ... ... ... ... ...
R
Z a ... ... A 16 ... ... B 8 ... ... C v ... ...
Result
store probe(16) probe(10) probe(8) probe(27) store
Z b ... ... A 27 ... ... B 10 ... ... C y ... ...
morsel morsel
Dispatcher
Desktop Online Public Server
Prep Extract Hyper API Federation Prep
Support
Infrastructure
Workload
And that’s not all due to constant strings…
Need a query plan visualizer? https://github.com/tableau/query-graphs/
100 200 300 400 500 600 700 800 900 1000 Number of relational operators 252 232 164 122 132 127 126 254 237 253 233 123 273 253 233 123 36 72 80 66 81 67 81 67 Group Join Outer Join Inner Join Percentiles Sort Group By Table Scan 1 2 3 4 5 6 7 8 9 10
Vogelsgesang et al.: Get Real: How Benchmarks Fail to Represent the Real World. DBTest’18
Tableau’s old data engine (TDE): vector-based engine inspired by MonetDB/X100 First step: Replace TDE as the backend of all Tableau Products Goals:
Having a gold standard is great! We just ran a lot (60k) of workbooks from Tableau Public and compared results. Simple, measurable goal: Get results to match for all and be fast, then we’re done J
Is it really worth to show the same result for all queries?
Our attitude changed over time: 1. 100% same results at all cost, customers don’t want their Viz to change! This is non-negotiable! 2. Well, but what if it changes to be correct? 3. Who said that Visualizations can’t change in the first place?
Is “5/7/2020” April 7 or July 5? TDE’s strategy: be aggressive finding a valid date. Sounds great! But horrible in the relational model! Silent failure: Sales workbook: More sales in the first 12 days of each months!
In Input ut TDE TDE 5/7/2020 April 7, 2020 15/7/2020 July 15, 2020 Hy Hyper April 7, 2020 NULL
T D E T D E T D E T D E T D E T D E T D E T D E T D E T D E
Improvement over TDE
We started with the SQLite test suite
(e.g., constant folding)
How to execute the tests?
query N expectConstantResult SELECT DATE '2001-09-28' + INTEGER '7'
# DATE + INTEGER -> DATE with overflow statement error 22003 SELECT DATE '4713-01-01 BC' - 1
Use EXPLAIN statement to test optimizer Introduce function to scan the own log to test for log messages
specific internals to the log
Special test functions with side effects to test further internals
memory
# Deduplication for simple domain queries query S EXPLAIN SELECT * FROM (SELECT a FROM t GROUP BY 1) t1, (SELECT a FROM t GROUP BY 1) t2
join(2) bnl explicitscan(3) explicitscan(6)->(4) groupby(4) tablescan(5) t
Easy to write, usually very succinct No recompilation needed In vivo: Can run code in specific query contexts Good test failure reporting Can update expected test results automatically Resembles customer usage of the system
Only test the code in question, not the whole SQL layer Runs faster
Big controversy!
SQLite tests are great, but they can’t simulate load from multiple connections Solution: Loadtest DSL
exec CREATE DATABASE mytestdb; connection mytestdb user=bob { repeat 100 { exec CREATE TABLE foo AS (SELECT x FROM generate_series(1,1e6) x); parallel 10 { exec UPDATE foo SET x = x+10 WHERE x % 2 = 0; } and 2 { test { query N SELECT SUM(x) FROM foo WHERE x % 2 = 1;
} } }
I. Correctness and perf
I. Starts up Hyper
(thus sending queries to Hyper)
/ mark values and records times
Improvement A over B
1. Measure perf on every commit 2. If perf regresses, file a defect 3. Make sure it’s not just noise 4. If perf improves, make it the new expectation
Found several vulnerabilities and defects with fuzzing!
1. Enable all compiler warnings and make them errors –Wall –Wextra –Werror -Woverloaded-virtual -Wunreachable-code-return… 2. Keep code clean: clang-format, clang-tidy 3. Clang static analyzers
Flaky test: A test that sometimes passes and sometimes fails Worst cases: It succeeds 99,9% of the time
Root causes:
When benchmarking a system …
Generated workloads are real
I. More and more tools generate queries
than hand-written TPC-X queries
lurking in other workloads
Public BI Benchmark from CWI
/github.com/cwida/public_bi_benchmark
Representations, CIDR’20
Build a system; not a throw-away prototype
Follow standards (e.g., SQL, PostgreSQL)
Build that system as if it was for production
defects could make your results invalid!
Academic Projects can make it into production
TPC-X is not everything!
Various layers of testing required
Benchmark responsibly Try out Hyper API! tabsoft.co/hyperapi