[PPT] - Database Learning Yongjoo Park Our Goal: reuse the work. Users PowerPoint Presentation

SLIDE 1

Database Learning

Building databases that become smarter over time

Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari

University of Michigan, Ann Arbor

SLIDE 2

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 3

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 4

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 5

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 6

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 7

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 8

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 9

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 10

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 11

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

Caching
Identical queries
Indexing/Materialization hints

Our Goal: reuse the work.

1

SLIDE 12

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 13

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 14

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 15

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 16

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 17

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis (Q1, A1)

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 18

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn+1 (1% err) An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 19

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn+1 (10% err) An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 20

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An+1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 21

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 22

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 23

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads

Approximate solutions

BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 24

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

1. User: enjoys 1% error bound in 1 second!
2. Formally, always more accurate
3. Popularity of analytic workloads ⇒ Approximate solutions
BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

SLIDE 25

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers Future Answers

The more past queries, the more Accurate and Faster

3

SLIDE 26

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers

The more past queries, the more Accurate and Faster

3

SLIDE 27

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers

The more past queries, the more Accurate and Faster

3

SLIDE 28

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 29

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 30

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 31

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 32

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 33

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 34

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 35

Our Idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 36

Our Idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 37

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 38

Our Idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 39

Our Idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 40

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 41

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 42

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 43

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 44

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 45

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 46

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

SLIDE 47

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 48

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 49

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 50

Outline

Our Approach

8

SLIDE 51

Problem Statement

Problem: Given queries, q1 qn

1 , and their approximate answers,

Find the most likely answer to qn

1 and its estimated error.

Our Result: Under a certain model assumption,

ur answer’s error bound
riginal answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

SLIDE 52

Problem Statement

Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,

ur answer’s error bound
riginal answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

SLIDE 53

Problem Statement

Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,

ur answer’s error bound ≤ original answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

SLIDE 54

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] Not overly confident on the unobserved data. Justification:

“

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

”

We provide empirical justifications in our report [?].

10

SLIDE 55

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

“

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

”

We provide empirical justifications in our report [?].

10

SLIDE 56

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

“

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

”

We provide empirical justifications in our report [?].

10

SLIDE 57

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

“

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

”

We provide empirical justifications in our report [?].

10

SLIDE 58

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

SLIDE 59

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

SLIDE 60

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

SLIDE 61

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

SLIDE 62

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values → correlation between answers

11

SLIDE 63

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr(θ3 | θ1, θ2)

Estimated answer 3

Two aggregations involve common values → correlation between answers

11

SLIDE 64

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

2. Without Alignment

view selection DBL system uptime

verhead
3. No Upfront Overhead

12

SLIDE 65

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

2. Without Alignment

view selection DBL system uptime

verhead
3. No Upfront Overhead

12

SLIDE 66

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

2. Without Alignment

view selection DBL system uptime

verhead
3. No Upfront Overhead

12

SLIDE 67

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

2. Without Alignment

view selection DBL system uptime

verhead
3. No Upfront Overhead

12

SLIDE 68

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

2. Without Alignment

view selection DBL system uptime

verhead
3. No Upfront Overhead

12

SLIDE 69

Outline

Experiment

13

SLIDE 70

Implementation and Experiments

1. Using Spark SQL as a backend
NoLearn: Sampling-based AQP engine
Verdict: Our database learning system
2. Datasets:
Customer1: Query log from an analytic DB vendor
TPC-H: 100G TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge)
SSD-backed HDFS for Spark’s data loading

14

SLIDE 71

Implementation and Experiments

1. Using Spark SQL as a backend
NoLearn: Sampling-based AQP engine
Verdict: Our database learning system
2. Datasets:
Customer1: Query log from an analytic DB vendor
TPC-H: 100G TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge)
SSD-backed HDFS for Spark’s data loading

14

SLIDE 72

Implementation and Experiments

1. Using Spark SQL as a backend
NoLearn: Sampling-based AQP engine
Verdict: Our database learning system
2. Datasets:
Customer1: Query log from an analytic DB vendor
TPC-H: 100G TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge)
SSD-backed HDFS for Spark’s data loading

14

SLIDE 73

Generality of Verdict

Dataset # Analyzed # Supported Percentage Customer1 3,342 2,463 73.7% TPC-H 21 14 63.6%

15

SLIDE 74

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec)

Customer1 dataset in memory

16

SLIDE 75

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn

Customer1 dataset in memory

16

SLIDE 76

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn Verdict

Customer1 dataset in memory

16

SLIDE 77

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0

time for query parsing, planning, etc.

Estimated Error (%) Latency (sec) NoLearn Verdict

Customer1 dataset in memory

16

SLIDE 78

Error Reduction

Customer1 TPC-H 20 40 60 80 100 92.58 95.33 Error Reduction (%)

For the same time budget; Data on SSD

17

SLIDE 79

Speedup (Data in Memory)

Customer1 TPC-H 2 4 6 8 10 12 2.01 2.63 Speedup (x)

For the same target error of 2%.

18

SLIDE 80

Speedup (Data on SSD)

Customer1 TPC-H 2 4 6 8 10 12 10.43 4.09 Speedup (x)

For the same target error of 2%.

19

SLIDE 81

Reliability of Estimated Error Guarantees

1% 2% 4% 8% 16% 32% 5 10 15 20 25 30 35 Guaranteed Error (%) Actual Error (%) mean 5%, 95% percentiles

Guaranteed error = from 95% confidence interval

20

SLIDE 82

Memory and Computational Overhead

1. Memory Overhead:
150KB per query for the Customer1 dataset
8KB per query for the TPC-H dataset
2. Computational Overhead:

Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

21

SLIDE 83

Memory and Computational Overhead

1. Memory Overhead:
150KB per query for the Customer1 dataset
8KB per query for the TPC-H dataset
2. Computational Overhead:

Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

21

SLIDE 84

Conclusion and Future Work

1. Database Learning:

Answers to past queries → boost your AQP!

2. Our prototype, Verdict, demonstrated:
Support 73.7% real-world analytical queries
Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

SLIDE 85

Conclusion and Future Work

1. Database Learning:

Answers to past queries → boost your AQP!

2. Our prototype, Verdict, demonstrated:
Support 73.7% real-world analytical queries
Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

SLIDE 86

Conclusion and Future Work

1. Database Learning:

Answers to past queries → boost your AQP!

2. Our prototype, Verdict, demonstrated:
Support 73.7% real-world analytical queries
Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

SLIDE 87

Conclusion and Future Work

1. Database Learning:

Answers to past queries → boost your AQP!

2. Our prototype, Verdict, demonstrated:
Support 73.7% real-world analytical queries
Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

SLIDE 88

Conclusion and Future Work

1. Database Learning:

Answers to past queries → boost your AQP!

2. Our prototype, Verdict, demonstrated:
Support 73.7% real-world analytical queries
Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

SLIDE 89

Thank You!

22

SLIDE 90

References I

23