datab databas ase l e lear earni ning ng to towa ward a a
play

Datab Databas ase L e Lear earni ning ng: To Towa ward a a - PowerPoint PPT Presentation

Datab Databas ase L e Lear earni ning ng: To Towa ward a a D Database t that Be Becomes s Sm Smarter Every y Tim ime Presented by: Huanyi Chen Where does the data come from? Real world The entire dataset follows certain


  1. Datab Databas ase L e Lear earni ning ng: To Towa ward a a D Database t that Be Becomes s Sm Smarter Every y Tim ime Presented by: Huanyi Chen

  2. Where does the data come from? § Real world § The entire dataset follows certain underlying distribution Database Learning: Toward a Database that Becomes PAGE 2 Smarter Every Time

  3. Income of a shop # of Day Income (CAD) Income of a shop per day 1 100 Income (CAD) 2 200 1800 3 400 1600 4 800 1400 5 1600 1200 1000 800 600 400 200 0 1 2 3 4 5 Database Learning: Toward a Database that Becomes PAGE 3 Smarter Every Time

  4. Income of a shop # of Day Income (CAD) Income of a shop per day 1 100 Income (CAD) 2 200 1800 3 400 1600 4 800 1400 5 1600 1200 6 ? 1000 800 600 400 200 0 1 2 3 4 5 Database Learning: Toward a Database that Becomes PAGE 4 Smarter Every Time

  5. Income of a shop § !"#$%& = 50 ∗ 2 , (n = Income of a shop per day 1, 2, 3 … ) Income (CAD) 7000 § No database needed if we 6000 can find the underlying distribution 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 Database Learning: Toward a Database that Becomes PAGE 5 Smarter Every Time

  6. Which distribution do we care? The exact underlying distribution that generates the entire § dataset and future unseen data? Not possible § An exact underlying distribution that generates the entire § dataset but excludes future unseen data? Benefits nothing. One can always make a model by using every value § of a column, but this model is not able to predict anything. We still need to store future data in order to answer queries. A possible distribution that generates the entire dataset and § future unseen data! Database Learning: Toward a Database that Becomes PAGE 6 Smarter Every Time

  7. Mismatching data § A possible distribution that generates the entire dataset and future unseen data is not able to match every data in the dataset § Not work when the accurate query results needed § Works in Approximate query processing (AQP) Database Learning: Toward a Database that Becomes PAGE 7 Smarter Every Time

  8. Approximate Query Processing (AQP) Trade accuracy for response time § Results are based on samples § Previous query results have no help in future queries § so it comes Database Learning - learning from past query answers! § Database Learning: Toward a Database that Becomes PAGE 8 Smarter Every Time

  9. Database Learning Engine: Verdict Target Workflow § Improve future query answers by using previous query answers from an AQP engine Database Learning: Toward a Database that Becomes PAGE 9 Smarter Every Time

  10. Verdict § A query is decomposed into possibly multiple query snippets § the answer of a snippet is a single scalar value Database Learning: Toward a Database that Becomes PAGE 10 Smarter Every Time

  11. Verdict § A query is decomposed into possibly multiple query snippets § Verdict exploits potential correlations between snippet answers to infer the answer of a new snippet # of Day Income (CAD) 1 100 old avg 2 200 new avg 3 400 4 800 5 1600 old avg and new avg are correlated Database Learning: Toward a Database that Becomes PAGE 11 Smarter Every Time

  12. Inference § !"#$%&'()!*# + %,-$# = /%$0)1()!* Observations Rules Prediction 2*1!3$ = 50 ∗ 2 8 Shop Income 100, 200, 400, 800, 3200, 6400, … 1600 9 8 = 9 8:; + 9 8:< Fibonacci Initial: 1, 1 2, 3, 5, 8, … Verdict Past snippet answers Maximize the Improved answer from AQP conditional and error for new + joint probability snippet AQP answer for the distribution new snippet function (pdf) Database Learning: Toward a Database that Becomes PAGE 12 Smarter Every Time

  13. ̅ ̅ Inference: pdf , - ' , … , " *+# = % ,+& ' % ',+& If we have ! " # = % & " ,+& = then the prediction is the value of ̅ % ',+& that maximizes ! - % ',+& | " # = % & , … , " *+# = % ,+& " ,+& = Database Learning: Toward a Database that Becomes PAGE 13 Smarter Every Time

  14. Inference: pdf How to find the pdf? § maximum entropy (ME) principal § § ℎ " = − ∫ " ⃗ ' ( log " ⃗ ' , ⃗ .ℎ/0/ ⃗ , ̅ 3 , … , ' 562 3 3 ', ' = (' 2 ' 562 ) The joint pdf maximizing the above entropy differs § depending on the kinds of given testable information Verdict uses the first and the second order statistics of the random § variables: mean, variances, and covariances. Database Learning: Toward a Database that Becomes PAGE 14 Smarter Every Time

  15. Inference: pdf Database Learning: Toward a Database that Becomes PAGE 15 Smarter Every Time

  16. Inference: model-based answer and error Generally Computing above conditional pdf may be a computationally expensive task Database Learning: Toward a Database that Becomes PAGE 16 Smarter Every Time

  17. Inference: model-based answer and error However, computing the conditional pdf in lemma 1 is not expensive and computable; the result is another normal distribution. $ are given by: The mean ! " and variance # " Database Learning: Toward a Database that Becomes PAGE 17 Smarter Every Time

  18. ̈ ̈ ̈ Inference: model-based answer and error Model -based answer § " #$% = ' ( § Model -based error § ) #$% = * ( § § Improved answer and error " #$% , + + " #$% , ̈ ) #$% = ) #$% (if validation succeed) § " #$% , + + ) #$% = " #$% , ) #$% (if validation failed, return AQP answers) § Database Learning: Toward a Database that Becomes PAGE 18 Smarter Every Time

  19. Inference: means, variances, and covariances § mean ( ⃗ " ) § the arithmetic mean of the past query answers for the mean of each random variable, # $ , …, # %&$ , ' # %&$ . § variances, and covariances ( Σ ) § the covariance between two query snippet answers is computable using the covariances between the attribute values involved in computing those answers Database Learning: Toward a Database that Becomes PAGE 19 Smarter Every Time

  20. Inference: means, variances, and covariances # of Day Income (CAD) 1 100 old avg 2 200 new avg 3 400 4 800 5 1600 old avg and new avg are correlated Database Learning: Toward a Database that Becomes PAGE 20 Smarter Every Time

  21. Inference: means, variances, and covariances Inter-tuple Covariances Income Income # of Day (CAD) (CAD) 1 100 100 2 200 200 3 400 400 4 800 800 5 1600 1600 Database Learning: Toward a Database that Becomes PAGE 21 Smarter Every Time

  22. Inference: means, variances, and covariances § Estimate the inter-tuple covariances § analytical covariance functions § squared exponential covariance functions: capable of approximating any continuous target function arbitrarily closely as the number of observations (here, query answers) increases § compute variances, and covariances ( Σ ) efficiently Database Learning: Toward a Database that Becomes PAGE 22 Smarter Every Time

  23. Experiments § Up to 23 × speedup for the same accuracy level § Small memory and computational overhead Database Learning: Toward a Database that Becomes PAGE 23 Smarter Every Time

  24. Summary § An idea: Database Learning § learning from past query answers § An implementation: Verdict § Given mean, variances, and covariances § Apply maximum entropy principal § Find a joint probability distribution function § Improve answer and error based on conditioning on snippet answers § https://verdictdb.org Database Learning: Toward a Database that Becomes PAGE 24 Smarter Every Time

  25. Q & A § Using testable information other than or in addition to mean, variances, covariances? § Are there any other possible inferential techniques? § Can we cut out training phase? Database Learning: Toward a Database that Becomes PAGE 25 Smarter Every Time

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend