Dissociation-based Optimization in Probabilistic Databases Maarten - - PowerPoint PPT Presentation

dissociation based optimization in
SMART_READER_LITE
LIVE PREVIEW

Dissociation-based Optimization in Probabilistic Databases Maarten - - PowerPoint PPT Presentation

Dissociation-based Optimization in Probabilistic Databases Maarten Van den Heuvel 1 , Floris Geerts 1 , Martin Theobald 2 1 Universiteit Antwerpen, Belgium 2 Ulm University, Germany Contents Introduction Issues with safety


slide-1
SLIDE 1

Dissociation-based Optimization in Probabilistic Databases

Maarten Van den Heuvel

1, Floris Geerts 1, Martin Theobald 2 1Universiteit Antwerpen, Belgium
 2Ulm University, Germany
slide-2
SLIDE 2

Contents

  • Introduction
  • Issues with safety
  • Dissociation: make (probabilistically) unsafe

queries safe

  • Top-k: using summaries to speed up inference in

safe queries

slide-3
SLIDE 3

Introduction

What is the director that is most likely to have directed a movie starring an award winning actor?

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

Top-1 query

slide-4
SLIDE 4

Introduction

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

Top-1 query

slide-5
SLIDE 5

Introduction

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 Answers Director P George Lucas 0.827 J.J. Abrahms 0.128

slide-6
SLIDE 6

Introduction

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 Answers Director P George Lucas 0.827 J.J. Abrahms 0.128

Top-1 query:

Not interested in exact P

slide-7
SLIDE 7

Introduction

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

Answers Director P George Lucas 0.827 J.J. Abrahms 0.128

P

  • Not interested in exact probability
  • Interested in ranking
slide-8
SLIDE 8

Introduction

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

Answers Director P George Lucas 0.827 J.J. Abrahms 0.128

P

  • Not interested in exact probability
  • Interested in ranking
  • Upper and lower bounds are enough

P

slide-9
SLIDE 9

Complexity

Some queries always have a query plan using probabilistic

  • perators = safe

⋈ π π

  • Prob-Join (⋈) : P(s) = P(t1) * … * P(tn)
  • Prob-Project (π) : P(s) = 1 - (1 - P(t1)) * … *(1- P(tn))

PlayedIn WonBy

x y y

Q(X):- PlayedIn(X, Y), WonBy(Y, Z)

project is always with duplicate elimination

PTIME in data size to calculate P(X)

slide-10
SLIDE 10

Complexity

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

Has no query plan using probabilistic operators since they assume independence = unsafe

#P-hard in data size to calculate P(X)

slide-11
SLIDE 11

Idea 1: Approximation using safe queries

Answers Actor P George Lucas 0.827 J.J. Abrahms 0.128

P

Plow(X) Pup(X) P(X)

Use Qlow for lower bound Use Qup for upper bound

slide-12
SLIDE 12

What if we pretend independence? Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U)

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

Approximation using safe queries

slide-13
SLIDE 13

What if we pretend independence?

PlayedIn Movie Actor P Star Wars Ewan 0.9 Star Wars Samuel 0.7 Star Trek Samuel 0.2 WonBy Movie* Actor Prize P Star Wars Ewan Oscar 0.9 Star Trek Ewan Oscar 0.9 Star Wars Samuel Grammy 0.8 Star Trek Samuel Grammy 0.8

Q’(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Y*, Z, U)

DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

  • Dissociation(1) gives upper and lower bounds
  • Use query plan of safe dissociated query on original data
  • Works for self-join free conjunctive queries

Approximation using safe queries

slide-14
SLIDE 14

Dissociation

Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Y*,Z, U)

P P

Q(X):- DirectedBy(X, Y, Z*), PlayedIn(Y, Z), WonBy(Z, U)

Downside to dissociation:

  • Exponential amount of

dissociations in query size

  • Different dissociations =>

different accuracy

  • Possibly insufficient

differentiation

  • Need to execute Q to know
slide-15
SLIDE 15

Idea 2: Approximation using summaries

P

Plow(X) Pup(X) P(X)

Safe queries alone are not efficient enough: Why not approximate these bounds with more bounds? P

Plow(X) Pup(X) P(X)

slide-16
SLIDE 16

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

PlayedIn(X,Y) Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy(Y,Z) Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 … … …

Q(X):- PlayedIn(X, Y), WonBy(Y, Z)

slide-17
SLIDE 17

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

WonBy(Y,Z) Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 … … … Answers Actor Pup Ewan McGregor ? Samuel L. Jackson ?

slide-18
SLIDE 18

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

WonBy(Y,Z) Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 … … Pmax Answers(πy) Actor Pup Ewan McGregor 0.965 Samuel L. Jackson 0.853

Depends on all n tuples WonBy(Ewan, …)

  • P(s) = 1 - (1 - P(t1)) * … *(1- P(tn))
  • Pup(s) = 1 - (1 - 0.9) * (1 - Pmax)n-1
slide-19
SLIDE 19

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

WonBy(Y,Z) Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 … … Pmax Answers(πy) Actor Pup Ewan McGregor 0.965 Samuel L. Jackson 0.853

Depends on all n tuples WonBy(Ewan, …)

  • P(s) = 1 - (1 - P(t1)) * … *(1- P(tn))
  • Pup(s) = 1 - (1 - 0.9) * (1 - Pmax)n-1

Summary:

  • Hold Pmax
  • Upper bound on n
slide-20
SLIDE 20

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

  • Recursively propagate the bounds 


to the query answers

Answers(πy) Actor Pup Plow Ewan McGregor 0.965 0.9 Samuel L. Jackson 0.853 0.8

Answers(πx) Movie Pup Plow Star Wars 0.82 0.30 Star Trek 0.64 0.12

slide-21
SLIDE 21

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

  • Recursively propagate the bounds 


to the query answers

  • Read more data and update bounds:
  • Lower Pmax
  • Better estimate for n

Answers(πy) Actor Pup Plow Ewan McGregor 0.92 0.9 Samuel L. Jackson 0.832 0.8

Answers(πx) Movie Pup Plow Star Wars 0.82 0.62 Star Trek 0.58 0.37

slide-22
SLIDE 22

Approximation using summaries

⋈ π π

PlayedIn WonBy

x y y

Stop if enough differentiation:

  • No possible candidates
  • No overlapping bounds

Answers(πy) Actor Pup Plow Ewan McGregor 0.92 0.9 Samuel L. Jackson 0.832 0.8

Answers(πx) Movie Pup Plow Star Wars 0.82 0.62 Star Trek 0.58 0.37

slide-23
SLIDE 23

Dissociation++

PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2 WonBy Actor Prize P Ewan McGregor Oscar 0.9 Samuel L. Jackson Grammy 0.8 DirectedBy Director Movie P George Lucas Star Wars 0.9 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1

Choosing a good dissociation is costly but:

  • Accuracy depends on number of faulty independence assumptions:
  • Estimate with n statistics in summaries!
slide-24
SLIDE 24

Questions/Challenges

  • Implementation: ongoing
  • Accuracy:
  • Bounds good enough to differentiate?
  • Statistics good enough to approximate faulty

independence assumptions

  • Summary: What are good summaries regarding

size, detail,…? Thank you for your attention!

slide-25
SLIDE 25

References

(1) Gatterbauer, W., & Suciu, D. (2014). Oblivious bounds on the probability of boolean functions. ACM Transactions on Database Systems (TODS), 39(1), 5. (2) Gatterbauer, Wolfgang, and Dan Suciu. "Approximate lifted inference with probabilistic databases." Proceedings of the VLDB Endowment 8.5 (2015): 629-640. (3) Dylla, M., Miliaraki, I., & Theobald, M. (2013, April). Top-k query processing in probabilistic databases with non-materialized

  • views. In Data Engineering (ICDE), 2013 IEEE 29th International

Conference on (pp. 122-133). IEEE.