A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - PowerPoint PPT Presentation

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1

Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary APPROXIMATE 2

Exis+ng Summariza+on Techniques Sampling Random, Stra+fied, Weighted,… Aggrega-on Online, Error Bounded (BlinkDB) AVG MIN MAX COUNT(*) 3

Aggrega-on Sampling Flights (origin, des+na+on, fl_+me, …) ~ 2.6 GB SELECT origin, COUNT(*) FROM Flights GROUP BY origin; Full Query Time: 20 sec SELECT * FROM Flights WHERE origin=‘SEATTLE, WA’ LIMIT 10; Full Query Time: 0.4 sec SELECT origin, COUNT(*) FROM Flights WHERE dest = ‘LAUREL, MS’ AND fl_time < 120 GROUP BY origin; Full Query Time: 30 sec 4

IDEA Find a compact, probabilis)c representa+on of our database Flights with high = Popular probability of existence By knowing the probability of rela+ons and tuples, we can answer queries probabilis+cally 5

The Simplest Summary Assume there is some concrete rela+on R(A, B), and you summarized R by its ac+ve domain and cardinality. Given this summary alone, what are the possible rela+ons R could have been (possible worlds of R)? 6

Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 Pr( I ) = 1 X Pr( I ) = 1 16 I ∈ P W D set of all possible instances (stand for Possible WorlDs) 7

Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 16 = 7 1 X Pr(( a 1 , b 1 )) = 16 Tuple Probability I ∈ P W D ( a 1 ,b 1 ) ∈ I 8

Principle of Maximum Entropy The Principle of Maximum Entropy states that subject to prior data, the probability distribu+on which best represents the state of knowledge is the one that has the largest entropy In other words, you want to maximize X Pr ( I ) ∗ log( Pr ( I )) − I ∈ P W D over all possible worlds 10

More Formally R(A 1 , …, A m ), |R| = n D i = dis+nct domain of A i, Tup = {D 1 x D 2 x … x D m }, Φ = set of equality predicates φ | σ φ ( R ) | Pr ( I ) = P − n Y α φ φ ∈ Φ X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 11

To include constraints on each φ s R ( φ ) = | σ φ ( R ) | = E [ | σ φ ( I ) | ] We can show deriva+ve of P with s R ( φ ) = n α φ P α φ respect to α φ P To solve, maximize the poten+al func+on by gradient descent X ln( α φ ) s R ( φ ) − ln( P n ) Ψ = φ ∈ Φ 12

Query Transforma+on Aggregates: take expected value SELECT origin, COUNT(*) GROUP BY + COUNT(*) FROM Flights GROUP BY origin; For each origin o E[|σ origin=o (Flights)|] φ E [ | σ φ ( I ) | ] = n α φ P α φ P SELECT origin, E[| σ origin (Flights)|] An equa+on in terms of the α’s FROM Flights, alpha_origin,... we have calculated and stored WHERE origin=alphas.origin GROUP BY origin; 13

Op+miza+ons X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 1. Factorize P (solve 1D predicates independently) 2. Add relevant 2+D predicates (ex: [A = a1 ^ B = b1]) 3. Remove tuples that don’t exist X Y P ∗ = P − α φ t ∈ ( T up − R ) φ ∈ Φ | φ ( t )= true 4. Change Basis (for correla+ons) new afribute AB = f(A, B) (ex: AB = A – B) 14

Experiment with TPC-H SELECT order_date, ship_date, COUNT(*) FROM orders JOIN lineitem GROUP BY order_date, ship_date; Error = | Est − True | Est + True 15

Change Basis: order_date – ship_date 16

Conclusion • Introduced new way to summarize and approximately query massive datasets – Complements sampling and approximate aggrega+on • Allows fine grained control over which afributes and values get summarized • Encouraging preliminary results • S+ll need to befer address scalability and expand query language • Need to understand how best to choose sta+s+cs 17

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - PowerPoint PPT Presentation

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1 Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary

Mul$-Document Summariza$on DELIVERABLE 4: CONTENT REALIZATION AND

An An aut automated ed pr probabilis babilistic ic alg algorit ithm hm for the he de

Probabilis)c Reasoning for Assembly-Based 3D Modeling

Crea%ng Consistent Scene Graphs Using a Probabilis%c Grammar 1

Func%onal Probabilis%c Programming CUFP 2013 Avi Pfeffer Charles

Probability and Statistics for Computer Science Probabilis+c analysis is mathema+cal, but

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Object oriented Object oriented Object oriented Object oriented approach and UML approach and

Outcome Based Approach in Outcome Based Approach in Outcome Based Approach in Outcome Based

Why probabili)es? Document representa)on is uncertain Understanding

Introduc:on protocol ? Iden:fica:on based on payload Payload

Applica'on: Diagnosis Daphne Koller Medical Diagnosis: Pathfinder (1992) Help

Probability and Statistics for Computer Science (II)

Probability and Statistics for Computer Science

Eligibility and Affordability Gaps for Californias Uninsured Laurel Lucia Director, Health

Sma marter Balanced Summa mmative Assessme ments 2018 Prelimi minary Report Jammie Behrendt,

FY 20-21 CHALLENGES & OPPORTUNITIES Presenters: Steve Mokrohisky, County Administrator

MULTIMESSENGER IMPLICATIONS OF BNS DETECTIONS GERGELY D LYA ELTE Dept. of Atomic Physics ELFT

Morgantown Regional Bike & Morgantown Regional Bike and Pedestrian Transportation Plan

Graph Coloring and Machine Proofs in Computer Science, 1977-2017 Andrew W. Appel Princeton

Why Mixed Effects Models? Mixed Effects Models Recap/Intro Three issues with ANOVA

Estimating Urban CO 2 Fluxes at High Spatial Resolution from In Situ Observations Alexis

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - PowerPoint PPT Presentation

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1 Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary

Mul$-Document Summariza$on DELIVERABLE 4: CONTENT REALIZATION AND

An An aut automated ed pr probabilis babilistic ic alg algorit ithm hm for the he de

Probabilis)c Reasoning for Assembly-Based 3D Modeling

Crea%ng Consistent Scene Graphs Using a Probabilis%c Grammar 1

Func%onal Probabilis%c Programming CUFP 2013 Avi Pfeffer Charles

Probability and Statistics for Computer Science Probabilis+c analysis is mathema+cal, but

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Object oriented Object oriented Object oriented Object oriented approach and UML approach and

Outcome Based Approach in Outcome Based Approach in Outcome Based Approach in Outcome Based

Why probabili)es? Document representa)on is uncertain Understanding

Introduc:on protocol ? Iden:fica:on based on payload Payload

Applica'on: Diagnosis Daphne Koller Medical Diagnosis: Pathfinder (1992) Help

Probability and Statistics for Computer Science (II)

Probability and Statistics for Computer Science

Eligibility and Affordability Gaps for Californias Uninsured Laurel Lucia Director, Health

Sma marter Balanced Summa mmative Assessme ments 2018 Prelimi minary Report Jammie Behrendt,

FY 20-21 CHALLENGES &amp; OPPORTUNITIES Presenters: Steve Mokrohisky, County Administrator

MULTIMESSENGER IMPLICATIONS OF BNS DETECTIONS GERGELY D LYA ELTE Dept. of Atomic Physics ELFT

Morgantown Regional Bike &amp; Morgantown Regional Bike and Pedestrian Transportation Plan

Graph Coloring and Machine Proofs in Computer Science, 1977-2017 Andrew W. Appel Princeton

Why Mixed Effects Models? Mixed Effects Models Recap/Intro Three issues with ANOVA

Estimating Urban CO 2 Fluxes at High Spatial Resolution from In Situ Observations Alexis

FY 20-21 CHALLENGES & OPPORTUNITIES Presenters: Steve Mokrohisky, County Administrator

Morgantown Regional Bike & Morgantown Regional Bike and Pedestrian Transportation Plan