Database Learning Yongjoo Park Our Goal: reuse the work. Users - - PowerPoint PPT Presentation

database learning
SMART_READER_LITE
LIVE PREVIEW

Database Learning Yongjoo Park Our Goal: reuse the work. Users - - PowerPoint PPT Presentation

Building databases that become smarter over time Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query After


slide-1
SLIDE 1

Database Learning

Building databases that become smarter over time

Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari

University of Michigan, Ann Arbor

slide-2
SLIDE 2

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-3
SLIDE 3

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-4
SLIDE 4

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-5
SLIDE 5

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-6
SLIDE 6

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-7
SLIDE 7

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-8
SLIDE 8

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-9
SLIDE 9

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-10
SLIDE 10

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-11
SLIDE 11

Today’s Databases

Users Database

query Answer to query

After answering queries, THE WORK is almost completely WASTED. Small exceptions:

  • Caching
  • Identical queries
  • Indexing/Materialization hints

Our Goal: reuse the work.

1

slide-12
SLIDE 12

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-13
SLIDE 13

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-14
SLIDE 14

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-15
SLIDE 15

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-16
SLIDE 16

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-17
SLIDE 17

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis (Q1, A1)

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-18
SLIDE 18

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn+1 (1% err) An

1 (1% err, 1 sec)

Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-19
SLIDE 19

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn+1 (10% err) An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-20
SLIDE 20

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An

1 (1% err, 1 sec)

Qn

1 (10% err)

An+1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-21
SLIDE 21

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-22
SLIDE 22

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-23
SLIDE 23

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads

Approximate solutions

  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-24
SLIDE 24

A New Paradigm in AQP Setting

Users

Database Learning Query Synopsis Q1 A1

Slow, Accurate Fast, Inaccurate

Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn

1 (1% err)

An+1 (1% err, 1 sec) Qn

1 (10% err)

An

1 (10% err, 1 sec)

  • 1. User: enjoys 1% error bound in 1 second!
  • 2. Formally, always more accurate
  • 3. Popularity of analytic workloads ⇒ Approximate solutions
  • BlinkDB, SnappyData, Yahoo Druid, Facebook Presto, Infobright, etc.

2

slide-25
SLIDE 25

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers Future Answers

The more past queries, the more Accurate and Faster

3

slide-26
SLIDE 26

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers

The more past queries, the more Accurate and Faster

3

slide-27
SLIDE 27

From Machine Learning To Database Learning

Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers

The more past queries, the more Accurate and Faster

3

slide-28
SLIDE 28

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

slide-29
SLIDE 29

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

slide-30
SLIDE 30

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

slide-31
SLIDE 31

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

slide-32
SLIDE 32

Primary Benefit: Exploratory Workloads

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

slide-33
SLIDE 33

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

slide-34
SLIDE 34

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

  • · · ·

. . . . . . . . . . . . . . .

?

5

slide-35
SLIDE 35

Our Idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers

  • · · ·

. . . . . . . . . . . . . . .

?

5

slide-36
SLIDE 36

Our Idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

slide-37
SLIDE 37

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

  • · · ·

. . . . . . . . . . . . . . .

?

5

slide-38
SLIDE 38

Our Idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers

  • · · ·

. . . . . . . . . . . . . . .

?

5

slide-39
SLIDE 39

Our Idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

slide-40
SLIDE 40

Our Idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

slide-41
SLIDE 41

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-42
SLIDE 42

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-43
SLIDE 43

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-44
SLIDE 44

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-45
SLIDE 45

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-46
SLIDE 46

Concrete Example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6

slide-47
SLIDE 47

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

  • 1. Support a wide class of SQL queries
  • 2. No Assumptions about Data

BlinkDB DBL latency

  • 3. Lightweight

7

slide-48
SLIDE 48

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

  • 1. Support a wide class of SQL queries
  • 2. No Assumptions about Data

BlinkDB DBL latency

  • 3. Lightweight

7

slide-49
SLIDE 49

Design Goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

  • 1. Support a wide class of SQL queries
  • 2. No Assumptions about Data

BlinkDB DBL latency

  • 3. Lightweight

7

slide-50
SLIDE 50

Outline

Our Approach

8

slide-51
SLIDE 51

Problem Statement

Problem: Given queries, q1 qn

1 , and their approximate answers,

Find the most likely answer to qn

1 and its estimated error.

Our Result: Under a certain model assumption,

  • ur answer’s error bound
  • riginal answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

slide-52
SLIDE 52

Problem Statement

Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,

  • ur answer’s error bound
  • riginal answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

slide-53
SLIDE 53

Problem Statement

Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,

  • ur answer’s error bound ≤ original answer’s error bound

, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

9

slide-54
SLIDE 54

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] Not overly confident on the unobserved data. Justification:

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

We provide empirical justifications in our report [?].

10

slide-55
SLIDE 55

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

We provide empirical justifications in our report [?].

10

slide-56
SLIDE 56

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

We provide empirical justifications in our report [?].

10

slide-57
SLIDE 57

Our Model Assumption

Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:

agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]

We provide empirical justifications in our report [?].

10

slide-58
SLIDE 58

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

slide-59
SLIDE 59

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

slide-60
SLIDE 60

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

slide-61
SLIDE 61

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

11

slide-62
SLIDE 62

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values → correlation between answers

11

slide-63
SLIDE 63

Overview of Our Technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr(θ3 | θ1, θ2)

Estimated answer 3

Two aggregations involve common values → correlation between answers

11

slide-64
SLIDE 64

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

  • 1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

  • 2. Without Alignment

view selection DBL system uptime

  • verhead
  • 3. No Upfront Overhead

12

slide-65
SLIDE 65

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

  • 1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

  • 2. Without Alignment

view selection DBL system uptime

  • verhead
  • 3. No Upfront Overhead

12

slide-66
SLIDE 66

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

  • 1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

  • 2. Without Alignment

view selection DBL system uptime

  • verhead
  • 3. No Upfront Overhead

12

slide-67
SLIDE 67

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

  • 1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

  • 2. Without Alignment

view selection DBL system uptime

  • verhead
  • 3. No Upfront Overhead

12

slide-68
SLIDE 68

Benefits of Database Learning

Database Learning vs. Indexing

Indexing DBL database size storage

  • 1. Little Storage Overhead

Database Learning vs. Materialized View Selection

date

  • 2. Without Alignment

view selection DBL system uptime

  • verhead
  • 3. No Upfront Overhead

12

slide-69
SLIDE 69

Outline

Experiment

13

slide-70
SLIDE 70

Implementation and Experiments

  • 1. Using Spark SQL as a backend
  • NoLearn: Sampling-based AQP engine
  • Verdict: Our database learning system
  • 2. Datasets:
  • Customer1: Query log from an analytic DB vendor
  • TPC-H: 100G TPC-H dataset
  • 3. Environment:
  • 5 Amazon EC2 workers (m4.2xlarge)
  • SSD-backed HDFS for Spark’s data loading

14

slide-71
SLIDE 71

Implementation and Experiments

  • 1. Using Spark SQL as a backend
  • NoLearn: Sampling-based AQP engine
  • Verdict: Our database learning system
  • 2. Datasets:
  • Customer1: Query log from an analytic DB vendor
  • TPC-H: 100G TPC-H dataset
  • 3. Environment:
  • 5 Amazon EC2 workers (m4.2xlarge)
  • SSD-backed HDFS for Spark’s data loading

14

slide-72
SLIDE 72

Implementation and Experiments

  • 1. Using Spark SQL as a backend
  • NoLearn: Sampling-based AQP engine
  • Verdict: Our database learning system
  • 2. Datasets:
  • Customer1: Query log from an analytic DB vendor
  • TPC-H: 100G TPC-H dataset
  • 3. Environment:
  • 5 Amazon EC2 workers (m4.2xlarge)
  • SSD-backed HDFS for Spark’s data loading

14

slide-73
SLIDE 73

Generality of Verdict

Dataset # Analyzed # Supported Percentage Customer1 3,342 2,463 73.7% TPC-H 21 14 63.6%

15

slide-74
SLIDE 74

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec)

Customer1 dataset in memory

16

slide-75
SLIDE 75

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn

Customer1 dataset in memory

16

slide-76
SLIDE 76

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn Verdict

Customer1 dataset in memory

16

slide-77
SLIDE 77

Latency-Error Trade-off

1 2 3 4 5 1.0 2.0 3.0 4.0 5.0

time for query parsing, planning, etc.

Estimated Error (%) Latency (sec) NoLearn Verdict

Customer1 dataset in memory

16

slide-78
SLIDE 78

Error Reduction

Customer1 TPC-H 20 40 60 80 100 92.58 95.33 Error Reduction (%)

For the same time budget; Data on SSD

17

slide-79
SLIDE 79

Speedup (Data in Memory)

Customer1 TPC-H 2 4 6 8 10 12 2.01 2.63 Speedup (x)

For the same target error of 2%.

18

slide-80
SLIDE 80

Speedup (Data on SSD)

Customer1 TPC-H 2 4 6 8 10 12 10.43 4.09 Speedup (x)

For the same target error of 2%.

19

slide-81
SLIDE 81

Reliability of Estimated Error Guarantees

1% 2% 4% 8% 16% 32% 5 10 15 20 25 30 35 Guaranteed Error (%) Actual Error (%) mean 5%, 95% percentiles

Guaranteed error = from 95% confidence interval

20

slide-82
SLIDE 82

Memory and Computational Overhead

  • 1. Memory Overhead:
  • 150KB per query for the Customer1 dataset
  • 8KB per query for the TPC-H dataset
  • 2. Computational Overhead:

Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

21

slide-83
SLIDE 83

Memory and Computational Overhead

  • 1. Memory Overhead:
  • 150KB per query for the Customer1 dataset
  • 8KB per query for the TPC-H dataset
  • 2. Computational Overhead:

Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

21

slide-84
SLIDE 84

Conclusion and Future Work

  • 1. Database Learning:

Answers to past queries → boost your AQP!

  • 2. Our prototype, Verdict, demonstrated:
  • Support 73.7% real-world analytical queries
  • Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

slide-85
SLIDE 85

Conclusion and Future Work

  • 1. Database Learning:

Answers to past queries → boost your AQP!

  • 2. Our prototype, Verdict, demonstrated:
  • Support 73.7% real-world analytical queries
  • Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

slide-86
SLIDE 86

Conclusion and Future Work

  • 1. Database Learning:

Answers to past queries → boost your AQP!

  • 2. Our prototype, Verdict, demonstrated:
  • Support 73.7% real-world analytical queries
  • Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

slide-87
SLIDE 87

Conclusion and Future Work

  • 1. Database Learning:

Answers to past queries → boost your AQP!

  • 2. Our prototype, Verdict, demonstrated:
  • Support 73.7% real-world analytical queries
  • Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

slide-88
SLIDE 88

Conclusion and Future Work

  • 1. Database Learning:

Answers to past queries → boost your AQP!

  • 2. Our prototype, Verdict, demonstrated:
  • Support 73.7% real-world analytical queries
  • Error reduction up to 95% compared to existing AQP engines

Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted

22

slide-89
SLIDE 89

Thank You!

22

slide-90
SLIDE 90

References I

23