A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - - PowerPoint PPT Presentation

a probabilis c approach to data summariza on
SMART_READER_LITE
LIVE PREVIEW

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena - - PowerPoint PPT Presentation

A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1 Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary


slide-1
SLIDE 1

A Probabilis+c Approach to Data Summariza+on

Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015

1

slide-2
SLIDE 2

Original Database

(All Flights in US) Summary Flights from Los Angeles to San Diego

FAST APPROXIMATE

2

What are the most popular flights?

slide-3
SLIDE 3

Exis+ng Summariza+on Techniques

Sampling Aggrega-on

Random, Stra+fied, Weighted,…

AVG MIN MAX COUNT(*)

Online, Error Bounded (BlinkDB)

3

slide-4
SLIDE 4

Sampling Aggrega-on

SELECT origin, COUNT(*) FROM Flights GROUP BY origin; SELECT * FROM Flights WHERE origin=‘SEATTLE, WA’ LIMIT 10; SELECT origin, COUNT(*) FROM Flights WHERE dest = ‘LAUREL, MS’ AND fl_time < 120 GROUP BY origin;

Full Query Time: 30 sec Full Query Time: 0.4 sec Full Query Time: 20 sec

4

Flights (origin, des+na+on, fl_+me, …) ~ 2.6 GB

slide-5
SLIDE 5

IDEA Find a compact, probabilis)c representa+on of our database

Flights with high probability of existence = Popular

5

By knowing the probability of rela+ons and tuples, we can answer queries probabilis+cally

slide-6
SLIDE 6

The Simplest Summary

6

Assume there is some concrete rela+on R(A, B), and you summarized R by its ac+ve domain and cardinality. Given this summary alone, what are the possible rela+ons R could have been (possible worlds of R)?

slide-7
SLIDE 7

Possible World Seman+cs

A a1 a1 A a1 a2 A a2 a1 A a2 a2 B b1 b1 B b1 b2 B b2 b1 B b2 b2

4 4 X

7

Possible Instances = 16

X

I∈P W D

Pr(I) = 1

set of all possible instances (stand for Possible WorlDs)

Pr(I) = 1 16

A a1 a2 B b1 b2

n = 2 ac+ve domain

A B id1 id2

slofed instance

slide-8
SLIDE 8

Possible World Seman+cs

A a1 a1 A a1 a2 A a2 a1 A a2 a2 B b1 b1 B b1 b2 B b2 b1 B b2 b2

4 4 X

8

Possible Instances = 16

Pr((a1, b1)) = X

I∈P W D (a1,b1)∈I

1 16 = 7 16

A a1 a2 B b1 b2

n = 2 ac+ve domain

Tuple Probability A B id1 id2

slofed instance

slide-9
SLIDE 9

Adding Constraints

9

|σR.A = a1(R)| = 70 |σR.A = a2(R)| = 30 … |σR.A = a1 ^ R.B = b1(R)| = 40

probabilis+c instance

How can we solve for Pr(I)?

A a1 a2 B b1 b2

n = 100 ac+ve domain

E[|σI.A=a1(I)|] = 70 E[|σI.A=a2(I)|] = 30 . . . E[|σI.A=a1∧I.B=b1(I)|] = 40

X

I∈P W D

|σI.A=a1(I)| Pr(I) = 70 X

I∈P W D

|σI.A=a2(I)| Pr(I) = 30 . . . X

I∈P W D

|σI.A=a1∧I.B=b1(I)| Pr(I) = 40

slide-10
SLIDE 10

Principle of Maximum Entropy

The Principle of Maximum Entropy states that subject to prior data, the probability distribu+on which best represents the state of knowledge is the one that has the largest entropy In other words, you want to maximize

10

− X

I∈P W D

Pr(I) ∗ log(Pr(I))

  • ver all possible worlds
slide-11
SLIDE 11

More Formally

R(A1, …, Am), |R| = n Di = dis+nct domain of Ai, Tup = {D1 x D2 x … x Dm}, Φ = set of equality predicates φ

11

Pr(I) = P −n Y

φ∈Φ

α

|σφ(R)| φ

all possible tuples in

  • ur ac+ve domain

P = X

t∈T up

Y

φ∈Φ|φ(t)=true

αφ

slide-12
SLIDE 12

To include constraints on each φ

sR(φ) = |σφ(R)| = E[|σφ(I)|]

12

deriva+ve of P with respect to αφ

We can show

sR(φ) = nαφPαφ P

To solve, maximize the poten+al func+on by gradient descent

Ψ = X

φ∈Φ

ln(αφ)sR(φ) − ln(P n)

slide-13
SLIDE 13

Query Transforma+on

Aggregates: take expected value

SELECT origin, COUNT(*) FROM Flights GROUP BY origin; SELECT origin, E[|σorigin(Flights)|] FROM Flights, alpha_origin,... WHERE origin=alphas.origin GROUP BY origin;

13

GROUP BY + COUNT(*) E[|σorigin=o (Flights)|] φ For each origin o

E[|σφ(I)|] = nαφPαφ P

An equa+on in terms of the α’s we have calculated and stored

slide-14
SLIDE 14

Op+miza+ons

14

  • 1. Factorize P (solve 1D predicates independently)
  • 2. Add relevant 2+D predicates (ex: [A = a1 ^ B = b1])
  • 3. Remove tuples that don’t exist
  • 4. Change Basis (for correla+ons)

new afribute AB = f(A, B) (ex: AB = A – B)

P = X

t∈T up

Y

φ∈Φ|φ(t)=true

αφ

all possible tuples in

  • ur ac+ve domain

P∗ = P − X

t∈(T up−R)

Y

φ∈Φ|φ(t)=true

αφ

slide-15
SLIDE 15

15

SELECT order_date, ship_date, COUNT(*) FROM orders JOIN lineitem GROUP BY order_date, ship_date;

Error = |Est − True| Est + True

Experiment with TPC-H

slide-16
SLIDE 16

16

Change Basis: order_date – ship_date

slide-17
SLIDE 17

Conclusion

17

  • Introduced new way to summarize and approximately

query massive datasets

– Complements sampling and approximate aggrega+on

  • Allows fine grained control over which afributes and

values get summarized

  • Encouraging preliminary results
  • S+ll need to befer address scalability and expand query

language

  • Need to understand how best to choose sta+s+cs