Techniques for managing probabilistic data Dan Suciu University of - PowerPoint PPT Presentation

Semantics 1: Possible Tuples Review p Movie p mid rating P m42 7 0.5 id year P m42 4 0.3 m42 1995 0.6 m42 9 0.9 m99 2002 0.8 m99 7 0.6 m76 2002 0.3 m99 5 0.2 m76 6 0.3 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 Answer p 1 1995 mid rating id year mid rating id year m42 7 m42 1995 id year mid rating year P m42 7 p 4 1995 m42 1995 m42 4 m99 2002 id year mid rating m42 7 p 5 1995 m42 1995 m42 9 m42 9 m99 2002 mid rating m42 7 m76 2002 id year p 1 +p 4 +p 5 +p 8 +p 9 m99 2002 m42 4 m99 7 1995 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m42 9 m76 2002 m99 5 mid rating m99 5 m42 7 p 9 1995 id year m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 7 m42 4 p 3 +p 4 +p 7 m76 6 p 9 1995 2002 m42 9 m42 1995 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m76 6 m99 2002 m42 9 30 m99 7 m76 6 m99 7 m76 6 m76 6

Formal Definition ( Ω , P) tuple Query probability space q a Boolean query q(a) Probabilistic event: E = { ω | ω |= q(a) } Definition P(q(a)) = P(E) = ∑ ω |= q(a) P( ω ) Example q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 1995 q(1995) :- Movie p ( x ,1995), Review p ( x , z ), z>3 = marginal probability of q(1995) 31 P(q(1995))

Semantics 2: Possible Answers Possible mid rating id year mid rating id year m42 7 m42 1995 id year mid rating m42 7 m42 1995 m42 4 worlds m99 2002 id year mid rating m42 7 m42 1995 m42 9 m42 9 m99 2002 mid rating m76 2002 m42 7 m99 id 2002 year m42 4 m99 7 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m76 2002 m42 9 m99 5 mid rating m99 5 m42 7 id year m42 4 m42 1995 m42 9 m99 7 m76 6 m99 2002 m42 7 m76 6 m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m99 2002 m76 6 m42 9 m99 7 m76 6 m99 7 m76 6 m76 6 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 p 1 year Possible year p 2 year 1930 year answers p 3 1995 year 1990 1990 . . . 2002 1950 1999 1999 1960 2002 32 1970

Formal Definition ( Ω , P) , View Probability space v ( Ω ’, P’) New probability space Definition Ω ’ = { ω ’ | ∃ ω ∈ Ω , v( ω ) = ω ’} P’( ω ’) = ∑ ω : v( ω )= ω ’ P( ω ) “Image probability space” [Green&Tannen’06] 33

Query Semantics Best for • Possible tuples: expressing user queries – Simple, intuitive user interface – Query evaluation is probabilistic inference – But is not compositional • Possible answers: Best for – Is compositional defining views – Open research problems: user interface, query evaluation 34

Complex Models = Simple + Views Example adapted from [Gupta&Sarawagi’2006] Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . 2 . . . . Suppose House-no extracted independently from Street and City 35

Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . AddrH p AddrSC p ID House-No P ID Street City P 1 52 0.2 1 Goregaon West Mumbai 0.3 1 1 52-A 0.5 Goregaon West Mumbai 0.6 2 . . . . . . . . 2 . . . . . . . . . . . . Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: 36

Complex Models = Simple + Views Standard query rewriting: Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: User query: q(x) :- Address(x,y,z,’West Mumbai’)  Rewritten query q(x) :- AddrH(x,y), AddrSC(x,z,’West Mumbai’) 37

Complex Models = Simple + Views • In this simple example the view is already representable as a tuple disjoint/independent table • In general views can define more complex probability spaces over possible worlds, that are not disjoint/indepdendent Theorem [Dalvi&S’2007] Independent/disjoint tables + conjunctive views = a complete representation system 38

Discussion of Data Model Tuple-disjoint/independent tables: • Simple model, can store in any DBMS More advanced models: • Symbolic boolean expressions Fuhr and Roellke • Trio: add lineage [Widom05, Das Sarma’06, Benjelloun 06] • Probabilistic Relational Models [Getoor’2006] • Graphical models [Sen&Desphande’07] 39

Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 40

Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Location P q(z) :- HasObject p ( Book302 , y, z) L66 p4+p6 41 L45 p5+p7+p8

Disjoint Project p1+p2+p3 Π d p1 p2 p3 42

Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Person Location P Jim L66 1-(1-p3)(1-p6) q(y,z) :- HasObject p ( x ,y,z) John L45 1-(1-p1)(1-p7) 43 . . .

Independent Project 1-(1-p1)(1-p2)(1-p3) Π i p1 p2 p3 44

q(y) :- Movie p ( x , y ), Review p ( x , z ),z>3 A Taste of Query Evaluation Review Movie mid rating P m42 7 q1 id year P m42 4 q2 m42 1995 p1 m42 9 q3 m99 2002 p2 m99 7 q4 m76 2002 p3 m99 5 q5 Answer m76 6 q6 year P p1 × (1 - (1 - q1) × (1 - q2) × (1 - q3)) 1995 1 - (1 - ) × p2 × (1 - (1 - q4) × (1 - q5)) 2002 (1 - ) p3 × q6 45

q(y) :- Movie p ( x , y ), Review p ( x , z ) q(1995) Answer depends on query plan ! 1-(1-p1q1)(1-p1q2)(1-p1q3) 1-(1-p1(1-(1-q1)(1-q2)(1-q3)))(1-…)… Π iy Π iy p1q1 p1q2 p1(1-(1-q1)(1-q2)(1-q3)) ⋈ x ⋈ x p1q3 1-(1-q1)(1-q2)(1-q3) Π ix Movie(x,y) Review(x,z) Movie(x,y) p1 q1 Review(x,z) q2 p1 q1 q3 CORRECT q2 INCORRECT (“safe plan”) q3 46

Safe Plans are Efficient • Very efficient: run almost as fast as regular queries • Require only simple modifications of the relational operators • Or can be translated back into SQL and sent to any RDBMS Can we always generate a safe plan ? 47

A Hard Query S R p T p B C A B P C D P x1 y1 a x1 p1 y1 c q1 a x2 p2 x1 y2 y2 c q2 x2 y1 Π i Unsafe ! h(u,v) :- R p ( u , x ),S( x , y ),T p ( y , v ) (1-(1-p1)(1-p2))q1 ⋈ p2q2 p1 h(a,c) Π i p1 T ⋈ p2 p1 There is no safe plan ! 48 R S p2

Independent Queries Let q1, q2 be two boolean queries Definition q1, q2 are “independent” if P(q1, q2) = P(q1) P(q2) Also: P(q1 V q2) = 1 - (1 - P(q1))(1 - P(q2)) 49

Quiz: which are independent ? q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) 50

Answers q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) NO Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) NO R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) YES Prop If no two subgoals unify then q1,q2 are independent Note: necessary but not sufficient condition Theorem Independece is Π p 2 complete [Miklau&S’04] 51 Reducible to query containment [Machanavajjhala&Gehrke’06]

Disjoint Queries Let q1, q2 be two boolean queries Definition q1, q2 are “disjoint” if P(q1, q2) = 0 Iff q1, q2 depend on two disjoint tuples t1, t2 52

Quiz: which are disjoint ? q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) 53

Answers q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) Y HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) N HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) N Proposition q1, q2 are “disjoint” if they contain subgoals g1, g2: • Have the same values for the key attributes • these values are constants • have at least one different constant in the non-key attributes 54

Definition of Safe Operators “safe” if ∀ a, q1(x)q2(x) q(x) q1(a), q2(a) are ⋈ σ x=a Always independent “safe” q(x) q1(x) q2(x) q q “safe” if ∀ a, b, “safe” if ∀ a, b, q(a), q(b) are q(a), q(b) are Π i Π d disjoint independent q(x) q(x) 55

q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 1 q1 :- Movie(x,y c ), Review(x,z) Π iy Because these are dependent: Unsafe q1(m42,7)=Movie(m42,y c ),Review(m42,7) q1(m42,4)=Movie(m42,y c ),Review(m42,4) q1(x,z) :- Movie(x,y c ), Review(x,z) ⋈ x Movie(x,y) Review(x,z) 56

q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 2 q1 :- Movie(x,y c ), Review(x,z) Π iy Now these are independent ! Safe ! q1(m42) = Movie(m42,y c ), Review(m42,z) q1(m77) = Movie(m77,y c ), Review(m77,z) q1(x) :- Movie(x,y c ), Review(x,z) ⋈ x Π ix Movie(x,y) Review(x,z) 57

[Valiant’79] Complexity Class #P Definition #P is the class of functions f(x) for which there exists a PTIME non-deterministic Turing machine M s.t. f(x) = number of accepting computations of M on input x Examples: SAT = “given formula Φ , is Φ satisfiable ?” = NP-complete #SAT = “given formula Φ , count # of satisfying assignments” = #P-complete 58

[Valiant’79] [Provan&Ball’83] All You Need to Know About #P Class Example SAT #SAT (X ∨ Y ∨ Z) ∧ ( ¬ X ∨ U ∨ W) … 3CNF NP #P (X ∨ Y) ∧ ( ¬ X ∨ U) … 2CNF PTIME #P Positive, (X1 ∨ Y1) ∧ (X1 ∨ Y4) ∧ partitioned PTIME #P (X2 ∨ Y1) ∧ (X3 ∨ Y1) … 2CNF Positive, (X1 ∧ Y1) ∨ (X1 ∧ Y4) ∨ partitioned PTIME #P (X2 ∧ Y1) ∨ (X3 ∧ Y1) … 2DNF Here NP, #P means “NP-complete, #P-complete” 59

See also [Graedel et al. 98] #P-Hard Queries hd1 :- R p ( x ),S( x , y ),T p ( y ) Theorem The query hd1 is #P-hard Proof: Reduction from partitioned, positive 2DNF E.g. Φ = x1 y1 V x2 y1 V x1 y2 V x3 y2 reduces to R p S T p A P A B B P x1 y1 x1 0.5 y1 0.5 x2 y1 x2 0.5 y2 0.5 x1 y2 x3 0.5 x3 y2 # Φ = P(hd1) * 2 n 60

#P-Hard Queries • #P-hard queries do not have safe plans • Do not have any PTIME algorithm – Unless P = NP • Can be evaluated using probabilistic inference – Exponential time exact algorithms or – PTIME approximations, e.g. Luby&Karp • In our experience with MystiQ, unsafe queries are 2 orders of magnitude slower than safe queries, and that only after optimizations 61

Lessons What do users want ? • Arbitrary queries, not just safe queries – Safe query  very fast – Unsafe query  begs for optimizations What should the system do ? • Aggressively check if a query is safe • If not, aggressively search safe subqueries Key problem: identifying the safe queries 62

Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 63

Summary So Far • Lots of applications need probabilistic data • Tuple disjoint/independent data model – Sufficient for many applications – Can be made complete through views – Ideal for studying query evaluation • Query evaluation – Some (many ?) queries are inherently hard – Main optimization tool: safe queries 64

Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 66

PTIME Queries #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) R( x, y ), S( x, z ) hd2 = R( x ,y), S( y ) R( x , y), S( y ), T( ‘a’ , y) hd3 = R( x ,y), S(x, y ) R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) . . . . . . Will discuss next how to decide their complexity and how evaluate PTIME queries

Hierarchical Queries sg(x) = set of subgoals containing the variable x in a key position Definition A query q is hierarchical if forall x, y: sg(x) ⊇ sg(y) or sg(x) ⊆ sg(y) or sg(x) ∩ sg(y) = ∅ Non-hierarchical Hierarchical h1 = R( x ), S( x, y ), T( y ) q = R( x, y ), S( x, z ) x y x z S T R y S R 68

Case 1: CQ 1 + Independent • Dichotomy established in [Dalvi&S’2004] • CQ 1 (conjunctive queries, no self-joins): – R( x , y ), S( y , z ) OK – R( x , y ), R( y , z ) Not OK • Independent tuples only: – R( x , y ) OK – S( y ,z) Not OK 69

[Dalvi&S’2004] CQ 1 + Independent Theorem Forall q ∈ CQ 1 : • q is hierarchical, has a safe plan, and is in PTIME, OR • q is not hierarchical and is #P-hard 70

The PTIME Queries Algorithm : convert a Hierarchy to a Safe Plan Independent 1. Root variable u  Π i project -u 2. Connected components  Join 3. Single subgoal  Leaf node Π i -x q = R( x, y ), S( x, z )  ⋈ x x z y S R Π d Π d -y -z R p ( x , y ) S p ( x , z ) 71

P(q) = 1 - (1-p 1 (1-(1-q 1 )(1-q 2 ))) * (1-p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 ))) Π -x A P a 1 p 1 (1-(1-q 1 )(1-q 2 )) q = a 2 p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 )) ⋈ x R( x , y ), S( x , z ) A P a 1 1-(1-q 1 )(1-q 2 ) a 2 1-(1-q 3 )(1-q 4 )(1-q 5 ) Π -y Π -z A C P a 1 c 1 q 1 R p ( x , y ) a 1 c 2 q 2 S p ( x , z ) a 2 c 3 q 3 A B P a 2 c 4 q 4 a 1 b 1 p 1 72 a 2 c 5 q 5 a 2 b 2 p 2

[D&S’2004] The #P-Hard Queries Are precisely the non-hierarchical queries. Example: hd1 :- R( x ), S( x, y ), T( y ) More general: q :- …, R( x , …), S( x, y , …), T( y , …) , … Theorem Testing if q is PTIME or #P-hard is in AC 0 73

Quiz: What is their complexity ? q PTIME or #P ? R( x , y ),S( y , a , u ),T( y , y , v ) R( x , y ), S( x , y , z ), T( x , z ) R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) 74

Hint… q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z 75

…Answer q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) PTIME R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) #P T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u #P T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z PTIME 76

Case 2: CQ 1 +Disjoint/independent • Dichotomy: in [Dalvi et al.’06,Dalvi&S’07] • Some safe plans also in [Andritsos’2006] • CQ 1 (conjunctive queries, no self-joins) • Independent/independent tables are OK Theorem Forall q ∈ CQ 1 • q has a safe plan and is in PTIME, OR • q is #P-hard 77

The PTIME Queries Algorithm : find a Safe Plan 1. Root variable u  Π i -u 2. Variable u occurs in a subgoal with constant keys  Π D -u 3. Connected components  Join • Single subgoal  Leaf node y P q(y) :- R( x ,y,z) b 1-(1-p1-p2)(1-p3-p4) x y P i Π -x a1 b p1+p2 q1(x c ,y c ):-R( x c ,y c ,z) a2 b p3+p4 x y z P D Π -z b c1 p1 a1 b c2 p2 R( x ,y,z) b c1 p3 78 a2 b c2 p4

D Π -u Disjoint project R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) ⋈ u y x D Π -y T S R W p (‘a’,u) ⋈ y W U Disjoint u project I Π -x ⋈ x T p (y) U p (u,y) Independent project R p (x) S p (x,y) 79

[Dalvi&S’2007] The #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) There are variations on hd2, hd3 hd2 = R( x ,y), S( y ) (see paper) hd3 = R( x ,y), S(x, y ) In general, a query is #P-hard if it can be “rewritten” to hd1, hd2, hd3 or one of their “variations”. Theorem Testing if q is PTIME or #P-hard is PTIME complete 80

[Dalvi&S’2007b] Case 3: Any conjunctive query, independent tables Let q be hierarchical • x ⊇ y denotes: x is above y in the hierarchy • x ≡ y denotes: x ⊇ y and x ⊆ y Definition An inversion is a chain of unifications: x ⊃ y with u 1 ≡ v 1 with … with u n ≡ v n with x’ ⊂ y' Theorem Forall q ∈ CQ: • If q is non-hierarchical, or has an inversion* then it is #P-hard • Otherwise it is in PTIME 81 *without “eraser”: see paper.

[Dalvi&S’2007b] The #P-hard Queries Hierarchical queries with “inversions”: hi1 = R( x ), S( x , y ), S( x’ , y’ ), T( y’ ) x ⊃ y unifies with x’ ⊂ y’ x y’ R S S T y x’ hi2 = R( x ), S( x , y ), S( u , v ), S’( u , v ),S’( x’ , y’ ), T( y’ ) x ⊃ y unifies with u ≡ v, which unifies with x’ ⊂ y’ u v x y’ R S S S’ S’ T y x’ 82

The #P-hard Queries A query with a long inversion: hi k = R( x ), S 0 ( x , y ), S 0 ( u 1 , v 1 ), S 1 ( u 1 , v 1 ) S 1 ( u 2 , v 2 ), S 2 ( u 2 , v 2 ), . . . S k ( x ’, y ’), T( y’ ) 83

The #P-hard Queries Sometimes inversions are exposed only after making a copy of the query q = R( x , y ), R( y , z ) R(x,y),R(y,z) R(x’,y’), R(y’,z’) 84

The PTIME Queries Find movies with high reviews from Joe and Jim: q(x) :- Movie(x,y),Match(x,r), Review(r,Joe,s), s > 4 Match(x,r’), Review(r’,Jim,s’),s’>4 Unify, but Don’t no inversion unify Note: the query is hierarchical because x is a “constant” 85

[Dalvi&S’2007b] The PTIME Queries Note: no “safe plans” are known ! PTIME algorithm for an inversion-free query is given in terms of expressions, not plans. Example: q :- R( a , x ), R( y , b ) p(q) = p(R(a,b))+(1-p(R(a,b))(1-(1- ∏ y ∈ Dom,y ≠ a (1-p(R(y,b))))(1- ∏ x ∈ Dom,x ≠ b (1-p(R(a,x)))) Open Problem : what are the natural operators that allow us to compute inversion-free queries in a database engine ? 86

Query Com- Why plexity R(a,x), R(y,b) PTIME b a R(a,x), R(x,b) PTIME a b R(x,y), R(y,z) #P Inversion R(x,y),R(y,z),R(z,u) #P Non- hierarchical R(x,y),R(y,z),R(z,x) #P Non- hierarchical R(x,y),R(y,z),R(x,z) #P Non- hierarchical 87

History • [Graedel, Gurevitch, Hirsch’98] – L(x,y),R(x,z),S(y),S(z) is #P-hard This is non-hierarchical, with a self-join • [Dalvi&S’2004] – R(x),S(x,y),T(y) is #P-hard This is non-hierarchical, w/o self-joins – Without self-joins: non-hierarchical = #P-hard, and hierarchical = PTIME • [Dalvi&S’2007] – All non-hierarchical queries are #P-hard 88

Summary on the Dichotomy WHY WE CARE: Safe queries = most powerful optimization we have What we know: • Three dichotomies, of increasing complexity • Dichotomy for aggregates in HAVING [Re&S.2007] What is open • CQ + independent/disjoint • Extensions to ≤ , ≥ , ≠ • Extensions to unions of conjunctive queries 89

Implementation and Optimization Topics: • General probabilistic inference • Optimization 1: Safe-subplans • Optimization 2: Top K • Performance of MystiQ 91

General Query Evaluation • Query q + database DB  boolean expression Φ q DB • Run any probabilistic inference algorithm on Φ q DB This approach is taken in Trio 92

Background: Probability of Boolean Expressions Given: P(X 1 )= p 1 , P(X 2 )= p 2 , P(X 3 )= p 3 Φ = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Compute P( Φ ) X 1 X 2 X 3 P Φ Pr( Φ )=(1-p 1 )p 2 p 3 + 0 0 0 0 p 1 (1-p 2 )p 3 + 0 0 1 0 0 1 0 0 p 1 p 2 (1-p 3 ) + (1-p 1 )p 2 p 3 0 1 1 1 Ω = p 1 p 2 p 3 1 0 0 0 p 1 (1-p 2 )p 3 1 0 1 1 p 1 p 2 (1-p 3 ) 1 1 0 1 #P-complete [Valiant:1979] 93 p 1 p 2 p 3 1 1 1 1

Query q + Database PDB  Φ R( x , y ), S( x , z ) q= S p R p A C P PDB= A B P a 1 c 1 q 1 Y 1 a 1 b 1 p 1 X 1 a 1 c 2 q 2 Y 2 a 2 b 2 p 2 X 2 a 2 c 3 q 3 Y 3 a 2 c 4 q 4 Y 4  a 2 c 5 q 5 Y 5 Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 94

Probabilistic Networks Nodes = random variables R( x , y ), S( x , z ) Edges = dependence Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 Ç Studied intensively in KR Typical networks: Ç Ç • Bayesian networks • Markov networks Æ Æ Æ Æ Æ • Boolean expressions X 1 X 2 Y 1 Y 2 Y 3 Y 4 Y 5 p 1 p 2 q 1 q 2 q 3 q 4 q 5

Inference Algorithms for Boolean Expressions • Randomized: – Naïve Monte Carlo – Luby and Karp • Deterministic – Algorithmic guarantees: [Trevisan’04], [Luby&Velickovic’91] – Inference algorithms in AI: variable elimination, junction trees,… – Tractable cases: bounded-width trees [Zabiyaka&Darwiche’06] 96

Naive Monte Carlo Simulation E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Cnt Ã 0 X 1 X 2 X 1 X 3 repeat N times X 2 X 3 randomly choose X 1 , X 2 , X 3 2 {0,1} if E(X 1 , X 2 , X 3 ) = 1 then Cnt = Cnt+1 May be big P = Cnt/N (in theory) return P /* ' Pr(E) */ Theorem (0-1 estimator) If N ¸ (1/ Pr(E)) £ (4ln(2/ δ )/ ε 2 ) then Pr[ | P/Pr(E) - 1 | > ε ] < δ 97

[Graedel,Gurevitch,Hirsch:1998] [Karp&Luby:1983] Improved Monte Carlo Simulation E = C 1 Ç C 2 Ç . . . Ç C m Cnt Ã 0; S Ã Pr(C 1 ) + … + Pr(C m ); repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(C i ) / S randomly choose X 1 , …, X n 2 {0,1} s.t. C i = 1 if C 1 =0 and C 2 =0 and … and C i-1 = 0 Now it’s then Cnt = Cnt+1 in PTIME P = Cnt/N * S / 2 n return P /* ' Pr(E) */ Theorem . If N ¸ (1/ m) £ (4ln(2/ δ )/ ε 2 ) then: Pr[ | P/Pr(E) - 1 | > ε ] < δ 98

[Re,Dalvi&S’2007] An Example q(x,u) :- R p ( x , y ), S p ( y , z ), T p ( z ,u) R p S p T p A B B C C D P P P b1 p1 b1 c1 q1 d1 r1 a1 c1 b2 p2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 c2 c3 q4 d2 r4 d3 r5 Step 1: evaluate this query on the representation to get the data qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) 99

R p S p T p A B B C P P C D P a1 b1 p1 b1 c1 q1 d1 r1 c1 a1 b2 p2 b2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 b2 c3 q4 c2 d2 r4 d3 r5 qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) Temp  A B P B C P C D P a1 b1 p1 b1 c1 q1 c1 d1 r1 a1 b2 p2 b2 c2 q3 c2 d1 r3 a2 b1 . . . . . . . . 100

Techniques for managing probabilistic data Dan Suciu University of - PowerPoint PPT Presentation

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are Deterministic Applications since 1970s required precise semantics Accounting, inventory Database tools are deterministic A

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Fall 2014

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017

The Probabilistic Method Techniques Union bound Argument from expectation Alterations The

Proof techniques for Nondeterministic and Probabilistic Processes Matthew Hennessy Trinity

Probabilistic Parsing: Issues & Improvement LING 571 Deep Processing Techniques for

P ( X | Y ) P ( Y ) P ( Y | X ) P ( X ) Introduction to Data Mining, 2 nd

Motivation Silicon-based techniques are approaching practical A Probabilistic Approach to Nano-

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2018

Ethics in Techniques for large-scale data Graham J.L. Kemp TECHNIQUES FOR LARGE-SCALE DATA

Slides Set 11 (part a): Sampling Techniques for Probabilistic and Deterministic Graphical models

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

Slides Set 9(part b): Sampling Techniques for Probabilistic and Deterministic Graphical models

PROBABILISTIC MODELS FOR STRUCTURED DATA Course Project Instructor: Yizhou Sun

A Probabilistic Model for Data Cube Compression and Query Approximation R. Missaoui, C. Goutte,

Analyzing paired-comparison data in R using probabilistic choice models Florian Wickelmaier The

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Data Mining and Matrices 12 Probabilistic Matrix Factorization Rainer Gemulla, Pauli

Challenges for Efficient Query Evaluation on Structured Probabilistic Data SUM2016 SEPTEMBER

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining

Logic Programming Techniques for Reasoning with Probabilistic Ontologies Riccardo Zese, Elena

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

CS6220: DATA MINING TECHNIQUES Chapter 2: Getting to Know Your Data Instructor: Yizhou Sun

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon & Institut

Techniques for managing probabilistic data Dan Suciu University of - PowerPoint PPT Presentation

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are Deterministic Applications since 1970s required precise semantics Accounting, inventory Database tools are deterministic A

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Fall 2014

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2017

The Probabilistic Method Techniques Union bound Argument from expectation Alterations The

Proof techniques for Nondeterministic and Probabilistic Processes Matthew Hennessy Trinity

Probabilistic Parsing: Issues &amp; Improvement LING 571 Deep Processing Techniques for

P ( X | Y ) P ( Y ) P ( Y | X ) P ( X ) Introduction to Data Mining, 2 nd

Motivation Silicon-based techniques are approaching practical A Probabilistic Approach to Nano-

Sampling Techniques for Probabilistic and Deterministic Graphical models ICS 276, Spring 2018

Ethics in Techniques for large-scale data Graham J.L. Kemp TECHNIQUES FOR LARGE-SCALE DATA

Slides Set 11 (part a): Sampling Techniques for Probabilistic and Deterministic Graphical models

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

Slides Set 9(part b): Sampling Techniques for Probabilistic and Deterministic Graphical models

PROBABILISTIC MODELS FOR STRUCTURED DATA Course Project Instructor: Yizhou Sun

A Probabilistic Model for Data Cube Compression and Query Approximation R. Missaoui, C. Goutte,

Analyzing paired-comparison data in R using probabilistic choice models Florian Wickelmaier The

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Data Mining and Matrices 12 Probabilistic Matrix Factorization Rainer Gemulla, Pauli

Challenges for Efficient Query Evaluation on Structured Probabilistic Data SUM2016 SEPTEMBER

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining

Logic Programming Techniques for Reasoning with Probabilistic Ontologies Riccardo Zese, Elena

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

CS6220: DATA MINING TECHNIQUES Chapter 2: Getting to Know Your Data Instructor: Yizhou Sun

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon &amp; Institut

Probabilistic Parsing: Issues & Improvement LING 571 Deep Processing Techniques for

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon & Institut