Toward timely, predictable and cost-effective data analytics Renata - - PowerPoint PPT Presentation
Toward timely, predictable and cost-effective data analytics Renata - - PowerPoint PPT Presentation
Toward timely, predictable and cost-effective data analytics Renata Borovica-Gaji DIAS, EPFL Big data proliferation Big data is when the current technology does not enable users to obtain timely , cost-effective , and quality answers to
Big data proliferation
2 * “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East“, 2012, IDC
“Big data is when the current technology does not enable users to obtain timely, cost-effective, and quality answers to data-driven
- questions. “ [Steve Todd, Berkeley]
₸ “Trends in big data analytics“, 2014, Kambatla et al
Technology follows Moore’s Law ₸
What business analysts want
3
Timely, predictable, cost-effective queries
5 10 15 20 25 30 35
DW Cost (million $)
Development Administration System [WinterCorp, 2013]
80% reuse within 3 hours
Time
Expected Actual
User frustration Wasted resources
Minimal data-to- insight time Low infrastructure cost Predictable response time
Thesis statement
4
As traditional DBMS rely on predefined assumptions about workload, data and storage, changes cause loss of performance and unpredictability.
Insight
Query execution must adapt at three levels (to workload, data and hardware) to stabilize and
- ptimize performance and cost.
Outline
- Minimize data-to-insight time
– Workload-driven adaptation
- Improve predictability of response time
– Data-driven adaptation
- Reduce analytics cost
– Cold storage & hardware-driven adaptation
5
[SIGMOD’12, VLDB’12, CACM’15] [DBTest’12, ICDE’15] [VLDB’16]
Outline
- Minimize data-to-insight time
– Workload-driven adaptation
- Improve predictability of response time
– Data-driven adaptation
- Reduce analytics cost
– Cold storage & hardware-driven adaptation
6
Current technology ≠ efficient exploration
Data-to-insight time
7
Time loading querying
X
10 20 30 40 50 60 70 80 90 100
Execution breakdown (%)
Processing (Q1) Convert Tokenize Parse I/O
Time to first insight too long
Traditional query stack Raw data querying stack
Overheads too high
insight data
Raw data querying
- verheads
Does not scale with data growth
NoDB: Workload-driven data loading & tuning
Optimize raw data querying stack
8
10 20 30 40 50 60 70 80 90 100
Execution breakdown (%)
Processing (Q1) Convert Tokenize Parse I/O
Raw data querying stack
Not everything needed for Q1
Let users show by asking queries
Response time
Q1 Q2 Q3 Q4
Q1 Q2 Q3 Q4
Q1 Q2 Q3 Q4
LOAD
Raw data querying DBMS NoDB
NoDB
Adjust to queries = progressively cheaper access
PostgresRaw: NoDB from idea to practice
9
Pointers to end of tuples Pointers to attributes
1|Supplier#01|17|335-1736|5755.94|each slyly... 2|Supplier#02|5|861-2259|4032.68| slyly bold... 3|Supplier#03|1|516-1199|4192.40|blithely... 4|Supplier#04|15|787-7479|4641.08|riously eve... 5|Supplier#05|11|21-151-690-3663|-283.84|. Slyly... 6|Supplier#06|14|24-696-997- 4969|1365.79|final... ...
- 2. Cache
17 5 …
NationKey
Supplier#01 Supplier#02 …
Name 5 10 1 3 5 7 9 11 13 15 Frequency # Buckets
- 3. Statistics
scan
- 1. Positional indexing
Workload
PostgresRaw in action
Setting: 7.5M tuples, 150 attributes, 11GB file Queries: 10 arbitrary attributes per query, vary selectivity
200 400 600 800 1000 1200 1400 1600 1800 MySQL CSV Engine MySQL DBMS X DBMS X w/ external files PostgreSQL PostgresRaw
Execution time (sec)
Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load
~ 7000 ~ 4806
Per query performance comparable to traditional DBMS
10
Data-to-insight time halved with PostgresRaw
Summary of PostgresRaw
11
- Query processing engine over raw data files
- Uses user queries for partial data loading and tuning
- Comparable performance to traditional DBMS
IMPACT
- Enables timely data exploration with 0 initialization
- Decouples user interest from data growth
Outline
- Minimize data-to-insight time
– Workload-driven adaptation
- Improve predictability of response time
– Data-driven adaptation
- Reduce analytics cost
– Cold storage & hardware-driven adaptation
12
Performance hurt after tuning
Index: with or without?
13
0.1 1 10 100 1000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q16 Q18 Q19 Q21 Q22
Normalized exec. time (log scale) TPC-H Query
Tuned Original With indexes Without indexes
Setting: TPC-H, SF10, DBMS-X, Tuning tool 5GB space for indexes 400
Re-optimization: risky
Access path selection problem
Selectivity Execution time
100% Full Scan
RISK
Index Scan Performance cliff Estimated Actual Full Scan
Statistics: unreliable advisor
14
Re-optimization
[MID’98, POP’04, RIO’05, BOU’14]
Removing variability due to (sub-optimal) choices
Quest for predictable execution
15
100% Index Scan
Predictable Execution
Selectivity Execution time
Full Scan
RISK
Smooth Scan
16
Morph between Index and Sequential Scan based on observed result distribution
Morphing mechanism
Modes:
- 1. Index Access: Traditional index access
- 2. Entire Page Probe: Index access probes entire page
- 3. Gradual Flattening Access: Probe adjacent region(s)
17
...
HEAP PAGES INDEX
Mode 1 Mode 2 Mode 3
Morphing policy
- Selectivity Increase -> Mode Increase
- Selectivity Decrease -> Mode Decrease
SEL_region >= SEL_global SEL_region < SEL_global
X X X XX X X X X X X X XX
INDEX
XX
SR:1 SR:1 SR:0.5 SR:0.75 SR:1 SR:1 SR:0.5 SG: 0
X: Page with result SR: Region selectivity SG: Global selectivity
1 0.81 0.66 0.7 0.75
Region snooping = Data-driven adaptation
HEAP PAGES
18
0.1 1 10 100 1000 10000 100000 0.001 0.01 0.1 1 20 50 75 100
Execution time (sec) (log scale)
Selectivity(%)
Full Scan Index Scan Sort Scan Smooth Scan
Smooth Scan in action
Setting: Micro-benchmark, 25GB table, Order by, Selectivity 0-100%
Near-optimal over entire selectivity range
19
Summary of Smooth Scan
20
- Statistics-oblivious access path
- Uses region snooping to morph between alternatives
- Near-optimal performance for all selectivities
IMPACT
- Removes access path selection decision
- Improves predictability by reducing variability in
query execution
Outline
- Minimize data-to-insight time
– Workload-driven adaptation
- Improve predictability of response time
– Data-driven adaptation
- Reduce analytics cost
– Cold storage & hardware-driven adaptation
21
PB-size storage at cost ~ tape and latency ~ disks
Proliferation of cold data
Cool one disk
22
Active disks
Cold Storage Devices (CSD) to the rescue
“80% enterprise data is cold with 60% CAGR” [Horison, 2015] “cold data: incredibly valuable for analysis” [Intel, 2013]
Power one disk
A B
Latency ~10ms Latency ~10sec
CSD in the storage tiering hierarchy
23
ns µs hour Performance Capacity Archival
Data Access Latency
15k RPM HDD
DRAM SSD
$$$ $$ $
7200 RPM HDD
ms
Tape
min sec
Tiers
ns µs hour Performance Capacity Archival
Data Access Latency
15k RPM HDD
DRAM SSD
$$$ $$
Can we shrink tiers to reduce cost?
$
7200 RPM HDD
ms
CSD
Tape
min sec
?
24
Tiers
COLD
$
CSD in the storage tiering hierarchy
Can we shrink tiers to reduce cost?
ns µs hour Performance Capacity Archival
Data Access Latency
15k RPM HDD
DRAM SSD
$$$ $$ $
7200 RPM HDD
ms min sec
CSD
?
25
Tiers
COLD
$
CSD in the storage tiering hierarchy
50 100 150 200 250 300 350 400
Cost (x1000$)
ns µs hour Performance
Data Access Latency
15k RPM HDD
DRAM SSD
$$$
COLD
$
ms min sec
CSD
Trad. 3-tier $159,641 CSD 2-tier [Horison, 2015]
Storing 100TB of data Tiers
But … can we run queries over CSD? CSD offer significant cost savings (40%)
CSD in the storage tiering hierarchy
26
27
Query execution over CSD
1 2 3 4 5
1 2 3 4 5
Average execution time (x1000 sec)
Number of clients (groups)
Postgre SQL Ideal
CSD Setting: virtualized enterprise datacenter, clients: PostgreSQL , TPCH 50, Q12, CSD: shared, layout: one client per group HDD
Lost opportunity: CSD relegated to archival storage
Skipper to the rescue
28
Network VM Cache Management
DB1 Cold Storage
Virtualized enterprise data center
DB2 DB3 VM1 VM2 VM3
I/O Scheduler object-group map. MJoin Hash Hash Hash Scan A Scan B Scan C A1 B1 C1 A2 PostgreSQL
1. 2. 3.
Progress driven caching: Favors caching of objects to maximize query progress Multi-way joins: Opportunistic execution triggered upon data arrival Novel ranking algorithm: Balances access efficiency across groups and fairness across clients
Skipper in action
29
1 2 3 4 5 1 2 3 4 5
Average exec. time (x1000 sec)
Number of clients (groups) PostgreSQL Ideal Skipper
Approximates HDD-based capacity tier by 20% avg.
Setting: multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group PostgreSQL on CSD PostgreSQL on HDD
Summary of Skipper
30
- Efficient query execution over CSD with:
1. Rank-based I/O scheduling 2. Out-of-order execution based on multi-way joins 3. Progress based caching policy
- Approximates performance of HDD-based storage tier
IMPACT
- Cold storage can reduce TCO by shrinking storage hierarchy
- Skipper enables data analytics-over-CSD-as-a-service
Thesis contributions
- Minimize data-to-insight time
– Workload-driven adaptation – Skip loading, tune as a byproduct of query execution
- Improve predictability of response time
– Data-driven adaptation – Remove access decisions a priori, transform gradually
- Reduce analytics cost
– Cold storage & hardware-driven adaptation – From plan pull-based to hardware push-based execution
- Uncertainty cured with adaptivity
31