Database Learning
Building databases that become smarter over time
Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari
University of Michigan, Ann Arbor
Database Learning Yongjoo Park Our Goal: reuse the work. Users - - PowerPoint PPT Presentation
Building databases that become smarter over time Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query After
Database Learning
Building databases that become smarter over time
Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari
University of Michigan, Ann Arbor
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
Today’s Databases
Users Database
query Answer to query
After answering queries, THE WORK is almost completely WASTED. Small exceptions:
Our Goal: reuse the work.
1
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis (Q1, A1)
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn+1 (1% err) An
1 (1% err, 1 sec)
Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn+1 (10% err) An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An
1 (1% err, 1 sec)
Qn
1 (10% err)
An+1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An+1 (1% err, 1 sec) Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An+1 (1% err, 1 sec) Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An+1 (1% err, 1 sec) Qn
1 (10% err)
An
1 (10% err, 1 sec)
Approximate solutions
2
A New Paradigm in AQP Setting
Users
Database Learning Query Synopsis Q1 A1
Slow, Accurate Fast, Inaccurate
Database Qi (1% err) Ai (1% err, 10 sec) Qi (1% err) Ai (1% err, 10 sec) Qn
1 (1% err)
An+1 (1% err, 1 sec) Qn
1 (10% err)
An
1 (10% err, 1 sec)
2
From Machine Learning To Database Learning
Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers Future Answers
The more past queries, the more Accurate and Faster
3
From Machine Learning To Database Learning
Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers
The more past queries, the more Accurate and Faster
3
From Machine Learning To Database Learning
Machine Learning: Past Observations ⇒ Future Predictions Database Learning: Past Answers ⇒ Future Answers
The more past queries, the more Accurate and Faster
3
Primary Benefit: Exploratory Workloads
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Primary Benefit: Exploratory Workloads
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Primary Benefit: Exploratory Workloads
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Primary Benefit: Exploratory Workloads
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Primary Benefit: Exploratory Workloads
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Our Idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our Idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our Idea
Q1 (Q1, A1) Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our Idea
Q1 (Q1, A1) Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our Idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our Idea
Q1 Q1 A1 Q2 (Q2, A2)
more queries and answers
. . . . . . . . . . . . . . .
5
Our Idea
Q1 Q1 A1 Q2 (Q2, A2)
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our Idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Concrete Example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) True data Ranges observed by past queries Model (with 95% confidence interval) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 6
Design Goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Design Goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Design Goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Outline
Our Approach
8
Problem Statement
Problem: Given queries, q1 qn
1 , and their approximate answers,
Find the most likely answer to qn
1 and its estimated error.
Our Result: Under a certain model assumption,
, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
9
Problem Statement
Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,
, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
9
Problem Statement
Problem: Given queries, {q1, . . . , qn+1}, and their approximate answers, Find the most likely answer to qn+1 and its estimated error. Our Result: Under a certain model assumption,
, (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
9
Our Model Assumption
Assumption: Goodness of the principle of maximum entropy [?] Not overly confident on the unobserved data. Justification:
agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]
We provide empirical justifications in our report [?].
10
Our Model Assumption
Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:
agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]
We provide empirical justifications in our report [?].
10
Our Model Assumption
Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:
agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]
We provide empirical justifications in our report [?].
10
Our Model Assumption
Assumption: Goodness of the principle of maximum entropy [?] → Not overly confident on the unobserved data. Justification:
agrees with everything that is known, but carefully avoids assuming anything that is not known — [?]
We provide empirical justifications in our report [?].
10
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
1 2 3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
11
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
1 2 3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
11
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
11
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
11
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values → correlation between answers
11
Overview of Our Technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr(θ3 | θ1, θ2)
Estimated answer 3
Two aggregations involve common values → correlation between answers
11
Benefits of Database Learning
Database Learning vs. Indexing
Indexing DBL database size storage
Database Learning vs. Materialized View Selection
date
view selection DBL system uptime
12
Benefits of Database Learning
Database Learning vs. Indexing
Indexing DBL database size storage
Database Learning vs. Materialized View Selection
date
view selection DBL system uptime
12
Benefits of Database Learning
Database Learning vs. Indexing
Indexing DBL database size storage
Database Learning vs. Materialized View Selection
date
view selection DBL system uptime
12
Benefits of Database Learning
Database Learning vs. Indexing
Indexing DBL database size storage
Database Learning vs. Materialized View Selection
date
view selection DBL system uptime
12
Benefits of Database Learning
Database Learning vs. Indexing
Indexing DBL database size storage
Database Learning vs. Materialized View Selection
date
view selection DBL system uptime
12
Outline
Experiment
13
Implementation and Experiments
14
Implementation and Experiments
14
Implementation and Experiments
14
Generality of Verdict
Dataset # Analyzed # Supported Percentage Customer1 3,342 2,463 73.7% TPC-H 21 14 63.6%
15
Latency-Error Trade-off
1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec)
Customer1 dataset in memory
16
Latency-Error Trade-off
1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn
Customer1 dataset in memory
16
Latency-Error Trade-off
1 2 3 4 5 1.0 2.0 3.0 4.0 5.0 Estimated Error (%) Latency (sec) NoLearn Verdict
Customer1 dataset in memory
16
Latency-Error Trade-off
1 2 3 4 5 1.0 2.0 3.0 4.0 5.0
time for query parsing, planning, etc.
Estimated Error (%) Latency (sec) NoLearn Verdict
Customer1 dataset in memory
16
Error Reduction
Customer1 TPC-H 20 40 60 80 100 92.58 95.33 Error Reduction (%)
For the same time budget; Data on SSD
17
Speedup (Data in Memory)
Customer1 TPC-H 2 4 6 8 10 12 2.01 2.63 Speedup (x)
For the same target error of 2%.
18
Speedup (Data on SSD)
Customer1 TPC-H 2 4 6 8 10 12 10.43 4.09 Speedup (x)
For the same target error of 2%.
19
Reliability of Estimated Error Guarantees
1% 2% 4% 8% 16% 32% 5 10 15 20 25 30 35 Guaranteed Error (%) Actual Error (%) mean 5%, 95% percentiles
Guaranteed error = from 95% confidence interval
20
Memory and Computational Overhead
Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
21
Memory and Computational Overhead
Latency Cached No-Cache NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
21
Conclusion and Future Work
Answers to past queries → boost your AQP!
Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted
22
Conclusion and Future Work
Answers to past queries → boost your AQP!
Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted
22
Conclusion and Future Work
Answers to past queries → boost your AQP!
Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted
22
Conclusion and Future Work
Answers to past queries → boost your AQP!
Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted
22
Conclusion and Future Work
Answers to past queries → boost your AQP!
Our next goal: Active Database Learning Aims to build a probabilistic model of data even before any queries submitted
22
Thank You!
22
References I
23