[PPT] - Toward timely, predictable and cost-effective data analytics Renata PowerPoint Presentation

SLIDE 1

Toward timely, predictable and cost-effective data analytics

Renata Borovica-Gajić

DIAS, EPFL

SLIDE 2

Big data proliferation

2 * “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East“, 2012, IDC

“Big data is when the current technology does not enable users to obtain timely, cost-effective, and quality answers to data-driven

questions. “ [Steve Todd, Berkeley]

₸ “Trends in big data analytics“, 2014, Kambatla et al

Technology follows Moore’s Law ₸

SLIDE 3

What business analysts want

3

Timely, predictable, cost-effective queries

5 10 15 20 25 30 35

DW Cost (million $)

Development Administration System [WinterCorp, 2013]

80% reuse within 3 hours

Time

Expected Actual

User frustration Wasted resources

Minimal data-to- insight time Low infrastructure cost Predictable response time

SLIDE 4

Thesis statement

4

As traditional DBMS rely on predefined assumptions about workload, data and storage, changes cause loss of performance and unpredictability.

Insight

Query execution must adapt at three levels (to workload, data and hardware) to stabilize and

ptimize performance and cost.

SLIDE 5

Outline

Minimize data-to-insight time

– Workload-driven adaptation

Improve predictability of response time

– Data-driven adaptation

Reduce analytics cost

– Cold storage & hardware-driven adaptation

5

[SIGMOD’12, VLDB’12, CACM’15] [DBTest’12, ICDE’15] [VLDB’16]

SLIDE 6

Outline

Minimize data-to-insight time

– Workload-driven adaptation

Improve predictability of response time

– Data-driven adaptation

Reduce analytics cost

– Cold storage & hardware-driven adaptation

6

SLIDE 7

Current technology ≠ efficient exploration

Data-to-insight time

7

Time loading querying

X

10 20 30 40 50 60 70 80 90 100

Execution breakdown (%)

Processing (Q1) Convert Tokenize Parse I/O

Time to first insight too long

Traditional query stack Raw data querying stack

Overheads too high

insight data

Raw data querying

verheads

Does not scale with data growth

SLIDE 8

NoDB: Workload-driven data loading & tuning

Optimize raw data querying stack

8

10 20 30 40 50 60 70 80 90 100

Execution breakdown (%)

Processing (Q1) Convert Tokenize Parse I/O

Raw data querying stack

Not everything needed for Q1

Let users show by asking queries

Response time

Q1 Q2 Q3 Q4

LOAD

Raw data querying DBMS NoDB

NoDB

SLIDE 9

Adjust to queries = progressively cheaper access

PostgresRaw: NoDB from idea to practice

9

Pointers to end of tuples Pointers to attributes

1|Supplier#01|17|335-1736|5755.94|each slyly... 2|Supplier#02|5|861-2259|4032.68| slyly bold... 3|Supplier#03|1|516-1199|4192.40|blithely... 4|Supplier#04|15|787-7479|4641.08|riously eve... 5|Supplier#05|11|21-151-690-3663|-283.84|. Slyly... 6|Supplier#06|14|24-696-997- 4969|1365.79|final... ...

2. Cache

17 5 …

NationKey

Supplier#01 Supplier#02 …

Name 5 10 1 3 5 7 9 11 13 15 Frequency # Buckets

3. Statistics

scan

1. Positional indexing

Workload

SLIDE 10

PostgresRaw in action

Setting: 7.5M tuples, 150 attributes, 11GB file Queries: 10 arbitrary attributes per query, vary selectivity

200 400 600 800 1000 1200 1400 1600 1800 MySQL CSV Engine MySQL DBMS X DBMS X w/ external files PostgreSQL PostgresRaw

Execution time (sec)

Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load

~ 7000 ~ 4806

Per query performance comparable to traditional DBMS

10

Data-to-insight time halved with PostgresRaw

SLIDE 11

Summary of PostgresRaw

11

Query processing engine over raw data files
Uses user queries for partial data loading and tuning
Comparable performance to traditional DBMS

IMPACT

Enables timely data exploration with 0 initialization
Decouples user interest from data growth

SLIDE 12

Outline

Minimize data-to-insight time

– Workload-driven adaptation

Improve predictability of response time

– Data-driven adaptation

Reduce analytics cost

– Cold storage & hardware-driven adaptation

12

SLIDE 13

Performance hurt after tuning

Index: with or without?

13

0.1 1 10 100 1000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q16 Q18 Q19 Q21 Q22

Normalized exec. time (log scale) TPC-H Query

Tuned Original With indexes Without indexes

Setting: TPC-H, SF10, DBMS-X, Tuning tool 5GB space for indexes 400

SLIDE 14

Re-optimization: risky

Access path selection problem

Selectivity Execution time

100% Full Scan

RISK

Index Scan Performance cliff Estimated Actual Full Scan

Statistics: unreliable advisor

14

Re-optimization

[MID’98, POP’04, RIO’05, BOU’14]

SLIDE 15

Removing variability due to (sub-optimal) choices

Quest for predictable execution

15

100% Index Scan

Predictable Execution

Selectivity Execution time

Full Scan

RISK

SLIDE 16

Smooth Scan

16

Morph between Index and Sequential Scan based on observed result distribution

SLIDE 17

Morphing mechanism

Modes:

1. Index Access: Traditional index access
2. Entire Page Probe: Index access probes entire page
3. Gradual Flattening Access: Probe adjacent region(s)

17

...

HEAP PAGES INDEX

Mode 1 Mode 2 Mode 3

SLIDE 18

Morphing policy

Selectivity Increase -> Mode Increase
Selectivity Decrease -> Mode Decrease

SEL_region >= SEL_global SEL_region < SEL_global

X X X XX X X X X X X X XX

INDEX

XX

SR:1 SR:1 SR:0.5 SR:0.75 SR:1 SR:1 SR:0.5 SG: 0

X: Page with result SR: Region selectivity SG: Global selectivity

1 0.81 0.66 0.7 0.75

Region snooping = Data-driven adaptation

HEAP PAGES

18

SLIDE 19

0.1 1 10 100 1000 10000 100000 0.001 0.01 0.1 1 20 50 75 100

Execution time (sec) (log scale)

Selectivity(%)

Full Scan Index Scan Sort Scan Smooth Scan

Smooth Scan in action

Setting: Micro-benchmark, 25GB table, Order by, Selectivity 0-100%

Near-optimal over entire selectivity range

19

SLIDE 20

Summary of Smooth Scan

20

Statistics-oblivious access path
Uses region snooping to morph between alternatives
Near-optimal performance for all selectivities

IMPACT

Removes access path selection decision
Improves predictability by reducing variability in

query execution

SLIDE 21

Outline

Minimize data-to-insight time

– Workload-driven adaptation

Improve predictability of response time

– Data-driven adaptation

Reduce analytics cost

– Cold storage & hardware-driven adaptation

21

SLIDE 22

PB-size storage at cost ~ tape and latency ~ disks

Proliferation of cold data

Cool one disk

22

Active disks

Cold Storage Devices (CSD) to the rescue

“80% enterprise data is cold with 60% CAGR” [Horison, 2015] “cold data: incredibly valuable for analysis” [Intel, 2013]

Power one disk

A B

Latency ~10ms Latency ~10sec

SLIDE 23

CSD in the storage tiering hierarchy

23

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$ $

7200 RPM HDD

ms

Tape

min sec

Tiers

SLIDE 24

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$

Can we shrink tiers to reduce cost?

$

7200 RPM HDD

ms

CSD

Tape

min sec

?

24

Tiers

COLD

$

CSD in the storage tiering hierarchy

SLIDE 25

Can we shrink tiers to reduce cost?

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$ $

7200 RPM HDD

ms min sec

CSD

?

25

Tiers

COLD

$

CSD in the storage tiering hierarchy

SLIDE 26

50 100 150 200 250 300 350 400

Cost (x1000$)

ns µs hour Performance

Data Access Latency

15k RPM HDD

DRAM SSD

$$$

COLD

$

ms min sec

CSD

Trad. 3-tier $159,641 CSD 2-tier [Horison, 2015]

Storing 100TB of data Tiers

But … can we run queries over CSD? CSD offer significant cost savings (40%)

CSD in the storage tiering hierarchy

26

SLIDE 27

27

Query execution over CSD

1 2 3 4 5

Average execution time (x1000 sec)

Number of clients (groups)

Postgre SQL Ideal

CSD Setting: virtualized enterprise datacenter, clients: PostgreSQL , TPCH 50, Q12, CSD: shared, layout: one client per group HDD

Lost opportunity: CSD relegated to archival storage

SLIDE 28

Skipper to the rescue

28

Network VM Cache Management

DB1 Cold Storage

Virtualized enterprise data center

DB2 DB3 VM1 VM2 VM3

I/O Scheduler object-group map. MJoin Hash Hash Hash Scan A Scan B Scan C A1 B1 C1 A2 PostgreSQL

1. 2. 3.

Progress driven caching: Favors caching of objects to maximize query progress Multi-way joins: Opportunistic execution triggered upon data arrival Novel ranking algorithm: Balances access efficiency across groups and fairness across clients

SLIDE 29

Skipper in action

29

1 2 3 4 5 1 2 3 4 5

Average exec. time (x1000 sec)

Number of clients (groups) PostgreSQL Ideal Skipper

Approximates HDD-based capacity tier by 20% avg.

Setting: multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group PostgreSQL on CSD PostgreSQL on HDD

SLIDE 30

Summary of Skipper

30

Efficient query execution over CSD with:

1. Rank-based I/O scheduling 2. Out-of-order execution based on multi-way joins 3. Progress based caching policy

Approximates performance of HDD-based storage tier

IMPACT

Cold storage can reduce TCO by shrinking storage hierarchy
Skipper enables data analytics-over-CSD-as-a-service

SLIDE 31

Thesis contributions

Minimize data-to-insight time

– Workload-driven adaptation – Skip loading, tune as a byproduct of query execution

Improve predictability of response time

– Data-driven adaptation – Remove access decisions a priori, transform gradually

Reduce analytics cost

– Cold storage & hardware-driven adaptation – From plan pull-based to hardware push-based execution

Uncertainty cured with adaptivity

31