a probabilis c approach to data summariza on
play

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - PowerPoint PPT Presentation

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1 Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary


  1. A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1

  2. Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary APPROXIMATE 2

  3. Exis+ng Summariza+on Techniques Sampling Random, Stra+fied, Weighted,… Aggrega-on Online, Error Bounded (BlinkDB) AVG MIN MAX COUNT(*) 3

  4. Aggrega-on Sampling Flights (origin, des+na+on, fl_+me, …) ~ 2.6 GB SELECT origin, COUNT(*) FROM Flights GROUP BY origin; Full Query Time: 20 sec SELECT * FROM Flights WHERE origin=‘SEATTLE, WA’ LIMIT 10; Full Query Time: 0.4 sec SELECT origin, COUNT(*) FROM Flights WHERE dest = ‘LAUREL, MS’ AND fl_time < 120 GROUP BY origin; Full Query Time: 30 sec 4

  5. IDEA Find a compact, probabilis)c representa+on of our database Flights with high = Popular probability of existence By knowing the probability of rela+ons and tuples, we can answer queries probabilis+cally 5

  6. The Simplest Summary Assume there is some concrete rela+on R(A, B), and you summarized R by its ac+ve domain and cardinality. Given this summary alone, what are the possible rela+ons R could have been (possible worlds of R)? 6

  7. Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 Pr( I ) = 1 X Pr( I ) = 1 16 I ∈ P W D set of all possible instances (stand for Possible WorlDs) 7

  8. Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 16 = 7 1 X Pr(( a 1 , b 1 )) = 16 Tuple Probability I ∈ P W D ( a 1 ,b 1 ) ∈ I 8

  9. Adding Constraints probabilis+c instance ac+ve domain |σ R.A = a1 (R)| = 70 E [ | σ I.A = a 1 ( I ) | ] = 70 A B a 1 b 1 |σ R.A = a2 (R)| = 30 E [ | σ I.A = a 2 ( I ) | ] = 30 a 2 b 2 … . . . n = 100 E [ | σ I.A = a 1 ∧ I.B = b 1 ( I ) | ] = 40 |σ R.A = a1 ^ R.B = b1 (R)| = 40 X | σ I.A = a 1 ( I ) | Pr( I ) = 70 I ∈ P W D How can we solve for X | σ I.A = a 2 ( I ) | Pr( I ) = 30 Pr(I)? I ∈ P W D . . . X | σ I.A = a 1 ∧ I.B = b 1 ( I ) | Pr( I ) = 40 I ∈ P W D 9

  10. Principle of Maximum Entropy The Principle of Maximum Entropy states that subject to prior data, the probability distribu+on which best represents the state of knowledge is the one that has the largest entropy In other words, you want to maximize X Pr ( I ) ∗ log( Pr ( I )) − I ∈ P W D over all possible worlds 10

  11. More Formally R(A 1 , …, A m ), |R| = n D i = dis+nct domain of A i, Tup = {D 1 x D 2 x … x D m }, Φ = set of equality predicates φ | σ φ ( R ) | Pr ( I ) = P − n Y α φ φ ∈ Φ X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 11

  12. To include constraints on each φ s R ( φ ) = | σ φ ( R ) | = E [ | σ φ ( I ) | ] We can show deriva+ve of P with s R ( φ ) = n α φ P α φ respect to α φ P To solve, maximize the poten+al func+on by gradient descent X ln( α φ ) s R ( φ ) − ln( P n ) Ψ = φ ∈ Φ 12

  13. Query Transforma+on Aggregates: take expected value SELECT origin, COUNT(*) GROUP BY + COUNT(*) FROM Flights GROUP BY origin; For each origin o E[|σ origin=o (Flights)|] φ E [ | σ φ ( I ) | ] = n α φ P α φ P SELECT origin, E[| σ origin (Flights)|] An equa+on in terms of the α’s FROM Flights, alpha_origin,... we have calculated and stored WHERE origin=alphas.origin GROUP BY origin; 13

  14. Op+miza+ons X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 1. Factorize P (solve 1D predicates independently) 2. Add relevant 2+D predicates (ex: [A = a1 ^ B = b1]) 3. Remove tuples that don’t exist X Y P ∗ = P − α φ t ∈ ( T up − R ) φ ∈ Φ | φ ( t )= true 4. Change Basis (for correla+ons) new afribute AB = f(A, B) (ex: AB = A – B) 14

  15. Experiment with TPC-H SELECT order_date, ship_date, COUNT(*) FROM orders JOIN lineitem GROUP BY order_date, ship_date; Error = | Est − True | Est + True 15

  16. Change Basis: order_date – ship_date 16

  17. Conclusion • Introduced new way to summarize and approximately query massive datasets – Complements sampling and approximate aggrega+on • Allows fine grained control over which afributes and values get summarized • Encouraging preliminary results • S+ll need to befer address scalability and expand query language • Need to understand how best to choose sta+s+cs 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend