A Probabilis+c Approach to Data Summariza+on
Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015
1
A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - - PowerPoint PPT Presentation
A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1 Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary
1
2
AVG MIN MAX COUNT(*)
3
SELECT origin, COUNT(*) FROM Flights GROUP BY origin; SELECT * FROM Flights WHERE origin=‘SEATTLE, WA’ LIMIT 10; SELECT origin, COUNT(*) FROM Flights WHERE dest = ‘LAUREL, MS’ AND fl_time < 120 GROUP BY origin;
Full Query Time: 30 sec Full Query Time: 0.4 sec Full Query Time: 20 sec
4
Flights (origin, des+na+on, fl_+me, …) ~ 2.6 GB
5
6
A a1 a1 A a1 a2 A a2 a1 A a2 a2 B b1 b1 B b1 b2 B b2 b1 B b2 b2
7
I∈P W D
A a1 a2 B b1 b2
A B id1 id2
A a1 a1 A a1 a2 A a2 a1 A a2 a2 B b1 b1 B b1 b2 B b2 b1 B b2 b2
8
A a1 a2 B b1 b2
Tuple Probability A B id1 id2
9
probabilis+c instance
A a1 a2 B b1 b2
I∈P W D
I∈P W D
I∈P W D
10
11
all possible tuples in
12
SELECT origin, COUNT(*) FROM Flights GROUP BY origin; SELECT origin, E[|σorigin(Flights)|] FROM Flights, alpha_origin,... WHERE origin=alphas.origin GROUP BY origin;
13
GROUP BY + COUNT(*) E[|σorigin=o (Flights)|] φ For each origin o
An equa+on in terms of the α’s we have calculated and stored
14
all possible tuples in
t∈(T up−R)
φ∈Φ|φ(t)=true
15
16
Change Basis: order_date – ship_date
17