Datab Databas ase L e Lear earni ning ng: To Towa ward a a - - PowerPoint PPT Presentation

datab databas ase l e lear earni ning ng to towa ward a a
SMART_READER_LITE
LIVE PREVIEW

Datab Databas ase L e Lear earni ning ng: To Towa ward a a - - PowerPoint PPT Presentation

Datab Databas ase L e Lear earni ning ng: To Towa ward a a D Database t that Be Becomes s Sm Smarter Every y Tim ime Presented by: Huanyi Chen Where does the data come from? Real world The entire dataset follows certain


slide-1
SLIDE 1

Datab Databas ase L e Lear earni ning ng: To Towa ward a a D Database t that Be Becomes s Sm Smarter Every y Tim ime

Presented by: Huanyi Chen

slide-2
SLIDE 2

Where does the data come from?

§ Real world § The entire dataset follows certain underlying distribution

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 2

slide-3
SLIDE 3

Income of a shop

# of Day Income (CAD) 1 100 2 200 3 400 4 800 5 1600

200 400 600 800 1000 1200 1400 1600 1800 1 2 3 4 5

Income of a shop per day

Income (CAD)

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 3

slide-4
SLIDE 4

Income of a shop

# of Day Income (CAD) 1 100 2 200 3 400 4 800 5 1600 6 ?

200 400 600 800 1000 1200 1400 1600 1800 1 2 3 4 5

Income of a shop per day

Income (CAD)

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 4

slide-5
SLIDE 5

Income of a shop

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 5

§ !"#$%& = 50 ∗ 2, (n =

1, 2, 3 … )

§ No database needed if we

can find the underlying distribution

1000 2000 3000 4000 5000 6000 7000 1 2 3 4 5 6 7

Income of a shop per day

Income (CAD)

slide-6
SLIDE 6

Which distribution do we care?

The exact underlying distribution that generates the entire

§

dataset and future unseen data?

Not possible

§

An exact underlying distribution that generates the entire

§

dataset but excludes future unseen data?

Benefits nothing. One can always make a model by using every value

§

  • f a column, but this model is not able to predict anything. We still

need to store future data in order to answer queries.

A

§

possible distribution that generates the entire dataset and future unseen data!

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 6

slide-7
SLIDE 7

Mismatching data

§ A possible distribution that generates the entire dataset and

future unseen data is not able to match every data in the dataset

§ Not work when the accurate query results needed § Works in Approximate query processing (AQP)

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 7

slide-8
SLIDE 8

Approximate Query Processing (AQP)

Trade accuracy for response time

§

Results are based on samples

§

Previous query results have no help in future queries

§ so it comes

§

Database Learning - learning from past query answers!

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 8

slide-9
SLIDE 9

Target

§ Improve future query answers by

using previous query answers from an AQP engine

Workflow

Database Learning Engine: Verdict

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 9

slide-10
SLIDE 10

Verdict

§ A query is decomposed into possibly multiple query snippets

§ the answer of a snippet is a single scalar value

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 10

slide-11
SLIDE 11

Verdict

§ A query is decomposed into possibly multiple query snippets § Verdict exploits potential correlations between snippet

answers to infer the answer of a new snippet

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 11

# of Day Income (CAD) 1 100 2 200 3 400 4 800 5 1600

  • ld avg

new avg

  • ld avg and new avg are correlated
slide-12
SLIDE 12

Inference

§ !"#$%&'()!*# + %,-$# = /%$0)1()!*

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 12

Observations Rules Prediction Shop Income 100, 200, 400, 800, 1600 2*1!3$ = 50 ∗ 28 3200, 6400, … Fibonacci Initial: 1, 1 9

8 = 98:; + 98:<

2, 3, 5, 8, … Verdict Past snippet answers from AQP + AQP answer for the new snippet Maximize the conditional joint probability distribution function (pdf) Improved answer and error for new snippet

slide-13
SLIDE 13

Inference: pdf

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 13

If we have ! "# = %&

', … , "*+# = %,+& '

, - ",+& = ̅ %',+& then the prediction is the value of ̅ %',+& that maximizes ! - ",+& = ̅ %',+& | "# = %& , … , "*+# = %,+&

slide-14
SLIDE 14

Inference: pdf

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 14

How to find the pdf?

§ maximum entropy (ME) principal

§

§ ℎ " = − ∫ " ⃗

' ( log " ⃗ ' , ⃗ ', .ℎ/0/ ⃗ ' = ('2

3, … , '562 3

, ̅ '562

3

)

The joint pdf maximizing the above entropy differs

§

depending on the kinds of given testable information

Verdict uses the first and the second order statistics of the random

§

variables: mean, variances, and covariances.

slide-15
SLIDE 15

Inference: pdf

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 15

slide-16
SLIDE 16

Inference: model-based answer and error

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 16

Generally Computing above conditional pdf may be a computationally expensive task

slide-17
SLIDE 17

Inference: model-based answer and error

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 17

However, computing the conditional pdf in lemma 1 is not expensive and computable; the result is another normal distribution. The mean !" and variance #"

$ are given by:

slide-18
SLIDE 18

Inference: model-based answer and error

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 18

Model

§

  • based answer

§

̈ "#$% = '(

Model

§

  • based error

§

̈ )#$% = *( § Improved answer and error

§

+ "#$%, + )#$% = ̈ "#$%, ̈ )#$% (if validation succeed)

§

+ "#$%, + )#$% = "#$%, )#$% (if validation failed, return AQP answers)

slide-19
SLIDE 19

Inference: means, variances, and covariances

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 19

§ mean ( ⃗

")

§ the arithmetic mean of the past query answers for the mean of each

random variable, #$, …, #%&$, ' #%&$. § variances, and covariances (Σ)

§ the covariance between two query snippet answers is computable

using the covariances between the attribute values involved in computing those answers

slide-20
SLIDE 20

Inference: means, variances, and covariances

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 20

# of Day Income (CAD) 1 100 2 200 3 400 4 800 5 1600

  • ld avg

new avg

  • ld avg and new avg are correlated
slide-21
SLIDE 21

Inference: means, variances, and covariances

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 21

# of Day Income (CAD) Income (CAD) 1 100 100 2 200 200 3 400 400 4 800 800 5 1600 1600

Inter-tuple Covariances

slide-22
SLIDE 22

Inference: means, variances, and covariances

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 22

§ Estimate the inter-tuple covariances

§ analytical covariance functions

§ squared exponential covariance functions: capable of approximating any

continuous target function arbitrarily closely as the number of

  • bservations (here, query answers) increases

§ compute variances, and covariances (Σ) efficiently

slide-23
SLIDE 23

Experiments

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 23

§ Up to 23× speedup for the same accuracy level § Small memory and computational overhead

slide-24
SLIDE 24

Summary

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 24

§ An idea: Database Learning

§ learning from past query answers

§ An implementation: Verdict

§ Given mean, variances, and covariances § Apply maximum entropy principal § Find a joint probability distribution function § Improve answer and error based on conditioning on snippet answers § https://verdictdb.org

slide-25
SLIDE 25

Q & A

Database Learning: Toward a Database that Becomes Smarter Every Time PAGE 25

§ Using testable information other than or in addition to mean,

variances, covariances?

§ Are there any other possible inferential techniques? § Can we cut out training phase?