techniques for managing probabilistic data
play

Techniques for managing probabilistic data Dan Suciu University of - PowerPoint PPT Presentation

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are Deterministic Applications since 1970s required precise semantics Accounting, inventory Database tools are deterministic A


  1. Semantics 1: Possible Tuples Review p Movie p mid rating P m42 7 0.5 id year P m42 4 0.3 m42 1995 0.6 m42 9 0.9 m99 2002 0.8 m99 7 0.6 m76 2002 0.3 m99 5 0.2 m76 6 0.3 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 Answer p 1 1995 mid rating id year mid rating id year m42 7 m42 1995 id year mid rating year P m42 7 p 4 1995 m42 1995 m42 4 m99 2002 id year mid rating m42 7 p 5 1995 m42 1995 m42 9 m42 9 m99 2002 mid rating m42 7 m76 2002 id year p 1 +p 4 +p 5 +p 8 +p 9 m99 2002 m42 4 m99 7 1995 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m42 9 m76 2002 m99 5 mid rating m99 5 m42 7 p 9 1995 id year m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 7 m42 4 p 3 +p 4 +p 7 m76 6 p 9 1995 2002 m42 9 m42 1995 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m76 6 m99 2002 m42 9 30 m99 7 m76 6 m99 7 m76 6 m76 6

  2. Formal Definition ( Ω , P) tuple Query probability space q a Boolean query q(a) Probabilistic event: E = { ω | ω |= q(a) } Definition P(q(a)) = P(E) = ∑ ω |= q(a) P( ω ) Example q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 1995 q(1995) :- Movie p ( x ,1995), Review p ( x , z ), z>3 = marginal probability of q(1995) 31 P(q(1995))

  3. Semantics 2: Possible Answers Possible mid rating id year mid rating id year m42 7 m42 1995 id year mid rating m42 7 m42 1995 m42 4 worlds m99 2002 id year mid rating m42 7 m42 1995 m42 9 m42 9 m99 2002 mid rating m76 2002 m42 7 m99 id 2002 year m42 4 m99 7 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m76 2002 m42 9 m99 5 mid rating m99 5 m42 7 id year m42 4 m42 1995 m42 9 m99 7 m76 6 m99 2002 m42 7 m76 6 m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m99 2002 m76 6 m42 9 m99 7 m76 6 m99 7 m76 6 m76 6 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 p 1 year Possible year p 2 year 1930 year answers p 3 1995 year 1990 1990 . . . 2002 1950 1999 1999 1960 2002 32 1970

  4. Formal Definition ( Ω , P) , View Probability space v ( Ω ’, P’) New probability space Definition Ω ’ = { ω ’ | ∃ ω ∈ Ω , v( ω ) = ω ’} P’( ω ’) = ∑ ω : v( ω )= ω ’ P( ω ) “Image probability space” [Green&Tannen’06] 33

  5. Query Semantics Best for • Possible tuples: expressing user queries – Simple, intuitive user interface – Query evaluation is probabilistic inference – But is not compositional • Possible answers: Best for – Is compositional defining views – Open research problems: user interface, query evaluation 34

  6. Complex Models = Simple + Views Example adapted from [Gupta&Sarawagi’2006] Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . 2 . . . . Suppose House-no extracted independently from Street and City 35

  7. Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . AddrH p AddrSC p ID House-No P ID Street City P 1 52 0.2 1 Goregaon West Mumbai 0.3 1 1 52-A 0.5 Goregaon West Mumbai 0.6 2 . . . . . . . . 2 . . . . . . . . . . . . Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: 36

  8. Complex Models = Simple + Views Standard query rewriting: Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: User query: q(x) :- Address(x,y,z,’West Mumbai’)  Rewritten query q(x) :- AddrH(x,y), AddrSC(x,z,’West Mumbai’) 37

  9. Complex Models = Simple + Views • In this simple example the view is already representable as a tuple disjoint/independent table • In general views can define more complex probability spaces over possible worlds, that are not disjoint/indepdendent Theorem [Dalvi&S’2007] Independent/disjoint tables + conjunctive views = a complete representation system 38

  10. Discussion of Data Model Tuple-disjoint/independent tables: • Simple model, can store in any DBMS More advanced models: • Symbolic boolean expressions Fuhr and Roellke • Trio: add lineage [Widom05, Das Sarma’06, Benjelloun 06] • Probabilistic Relational Models [Getoor’2006] • Graphical models [Sen&Desphande’07] 39

  11. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 40

  12. Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Location P q(z) :- HasObject p ( Book302 , y, z) L66 p4+p6 41 L45 p5+p7+p8

  13. Disjoint Project p1+p2+p3 Π d p1 p2 p3 42

  14. Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Person Location P Jim L66 1-(1-p3)(1-p6) q(y,z) :- HasObject p ( x ,y,z) John L45 1-(1-p1)(1-p7) 43 . . .

  15. Independent Project 1-(1-p1)(1-p2)(1-p3) Π i p1 p2 p3 44

  16. q(y) :- Movie p ( x , y ), Review p ( x , z ),z>3 A Taste of Query Evaluation Review Movie mid rating P m42 7 q1 id year P m42 4 q2 m42 1995 p1 m42 9 q3 m99 2002 p2 m99 7 q4 m76 2002 p3 m99 5 q5 Answer m76 6 q6 year P p1 × (1 - (1 - q1) × (1 - q2) × (1 - q3)) 1995 1 - (1 - ) × p2 × (1 - (1 - q4) × (1 - q5)) 2002 (1 - ) p3 × q6 45

  17. q(y) :- Movie p ( x , y ), Review p ( x , z ) q(1995) Answer depends on query plan ! 1-(1-p1q1)(1-p1q2)(1-p1q3) 1-(1-p1(1-(1-q1)(1-q2)(1-q3)))(1-…)… Π iy Π iy p1q1 p1q2 p1(1-(1-q1)(1-q2)(1-q3)) ⋈ x ⋈ x p1q3 1-(1-q1)(1-q2)(1-q3) Π ix Movie(x,y) Review(x,z) Movie(x,y) p1 q1 Review(x,z) q2 p1 q1 q3 CORRECT q2 INCORRECT (“safe plan”) q3 46

  18. Safe Plans are Efficient • Very efficient: run almost as fast as regular queries • Require only simple modifications of the relational operators • Or can be translated back into SQL and sent to any RDBMS Can we always generate a safe plan ? 47

  19. A Hard Query S R p T p B C A B P C D P x1 y1 a x1 p1 y1 c q1 a x2 p2 x1 y2 y2 c q2 x2 y1 Π i Unsafe ! h(u,v) :- R p ( u , x ),S( x , y ),T p ( y , v ) (1-(1-p1)(1-p2))q1 ⋈ p2q2 p1 h(a,c) Π i p1 T ⋈ p2 p1 There is no safe plan ! 48 R S p2

  20. Independent Queries Let q1, q2 be two boolean queries Definition q1, q2 are “independent” if P(q1, q2) = P(q1) P(q2) Also: P(q1 V q2) = 1 - (1 - P(q1))(1 - P(q2)) 49

  21. Quiz: which are independent ? q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) 50

  22. Answers q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) NO Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) NO R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) YES Prop If no two subgoals unify then q1,q2 are independent Note: necessary but not sufficient condition Theorem Independece is Π p 2 complete [Miklau&S’04] 51 Reducible to query containment [Machanavajjhala&Gehrke’06]

  23. Disjoint Queries Let q1, q2 be two boolean queries Definition q1, q2 are “disjoint” if P(q1, q2) = 0 Iff q1, q2 depend on two disjoint tuples t1, t2 52

  24. Quiz: which are disjoint ? q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) 53

  25. Answers q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) Y HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) N HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) N Proposition q1, q2 are “disjoint” if they contain subgoals g1, g2: • Have the same values for the key attributes • these values are constants • have at least one different constant in the non-key attributes 54

  26. Definition of Safe Operators “safe” if ∀ a, q1(x)q2(x) q(x) q1(a), q2(a) are ⋈ σ x=a Always independent “safe” q(x) q1(x) q2(x) q q “safe” if ∀ a, b, “safe” if ∀ a, b, q(a), q(b) are q(a), q(b) are Π i Π d disjoint independent q(x) q(x) 55

  27. q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 1 q1 :- Movie(x,y c ), Review(x,z) Π iy Because these are dependent: Unsafe q1(m42,7)=Movie(m42,y c ),Review(m42,7) q1(m42,4)=Movie(m42,y c ),Review(m42,4) q1(x,z) :- Movie(x,y c ), Review(x,z) ⋈ x Movie(x,y) Review(x,z) 56

  28. q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 2 q1 :- Movie(x,y c ), Review(x,z) Π iy Now these are independent ! Safe ! q1(m42) = Movie(m42,y c ), Review(m42,z) q1(m77) = Movie(m77,y c ), Review(m77,z) q1(x) :- Movie(x,y c ), Review(x,z) ⋈ x Π ix Movie(x,y) Review(x,z) 57

  29. [Valiant’79] Complexity Class #P Definition #P is the class of functions f(x) for which there exists a PTIME non-deterministic Turing machine M s.t. f(x) = number of accepting computations of M on input x Examples: SAT = “given formula Φ , is Φ satisfiable ?” = NP-complete #SAT = “given formula Φ , count # of satisfying assignments” = #P-complete 58

  30. [Valiant’79] [Provan&Ball’83] All You Need to Know About #P Class Example SAT #SAT (X ∨ Y ∨ Z) ∧ ( ¬ X ∨ U ∨ W) … 3CNF NP #P (X ∨ Y) ∧ ( ¬ X ∨ U) … 2CNF PTIME #P Positive, (X1 ∨ Y1) ∧ (X1 ∨ Y4) ∧ partitioned PTIME #P (X2 ∨ Y1) ∧ (X3 ∨ Y1) … 2CNF Positive, (X1 ∧ Y1) ∨ (X1 ∧ Y4) ∨ partitioned PTIME #P (X2 ∧ Y1) ∨ (X3 ∧ Y1) … 2DNF Here NP, #P means “NP-complete, #P-complete” 59

  31. See also [Graedel et al. 98] #P-Hard Queries hd1 :- R p ( x ),S( x , y ),T p ( y ) Theorem The query hd1 is #P-hard Proof: Reduction from partitioned, positive 2DNF E.g. Φ = x1 y1 V x2 y1 V x1 y2 V x3 y2 reduces to R p S T p A P A B B P x1 y1 x1 0.5 y1 0.5 x2 y1 x2 0.5 y2 0.5 x1 y2 x3 0.5 x3 y2 # Φ = P(hd1) * 2 n 60

  32. #P-Hard Queries • #P-hard queries do not have safe plans • Do not have any PTIME algorithm – Unless P = NP • Can be evaluated using probabilistic inference – Exponential time exact algorithms or – PTIME approximations, e.g. Luby&Karp • In our experience with MystiQ, unsafe queries are 2 orders of magnitude slower than safe queries, and that only after optimizations 61

  33. Lessons What do users want ? • Arbitrary queries, not just safe queries – Safe query  very fast – Unsafe query  begs for optimizations What should the system do ? • Aggressively check if a query is safe • If not, aggressively search safe subqueries Key problem: identifying the safe queries 62

  34. Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 63

  35. Summary So Far • Lots of applications need probabilistic data • Tuple disjoint/independent data model – Sufficient for many applications – Can be made complete through views – Ideal for studying query evaluation • Query evaluation – Some (many ?) queries are inherently hard – Main optimization tool: safe queries 64

  36. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 65

  37. Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 66

  38. PTIME Queries #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) R( x, y ), S( x, z ) hd2 = R( x ,y), S( y ) R( x , y), S( y ), T( ‘a’ , y) hd3 = R( x ,y), S(x, y ) R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) . . . . . . Will discuss next how to decide their complexity and how evaluate PTIME queries

  39. Hierarchical Queries sg(x) = set of subgoals containing the variable x in a key position Definition A query q is hierarchical if forall x, y: sg(x) ⊇ sg(y) or sg(x) ⊆ sg(y) or sg(x) ∩ sg(y) = ∅ Non-hierarchical Hierarchical h1 = R( x ), S( x, y ), T( y ) q = R( x, y ), S( x, z ) x y x z S T R y S R 68

  40. Case 1: CQ 1 + Independent • Dichotomy established in [Dalvi&S’2004] • CQ 1 (conjunctive queries, no self-joins): – R( x , y ), S( y , z ) OK – R( x , y ), R( y , z ) Not OK • Independent tuples only: – R( x , y ) OK – S( y ,z) Not OK 69

  41. [Dalvi&S’2004] CQ 1 + Independent Theorem Forall q ∈ CQ 1 : • q is hierarchical, has a safe plan, and is in PTIME, OR • q is not hierarchical and is #P-hard 70

  42. The PTIME Queries Algorithm : convert a Hierarchy to a Safe Plan Independent 1. Root variable u  Π i project -u 2. Connected components  Join 3. Single subgoal  Leaf node Π i -x q = R( x, y ), S( x, z )  ⋈ x x z y S R Π d Π d -y -z R p ( x , y ) S p ( x , z ) 71

  43. P(q) = 1 - (1-p 1 (1-(1-q 1 )(1-q 2 ))) * (1-p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 ))) Π -x A P a 1 p 1 (1-(1-q 1 )(1-q 2 )) q = a 2 p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 )) ⋈ x R( x , y ), S( x , z ) A P a 1 1-(1-q 1 )(1-q 2 ) a 2 1-(1-q 3 )(1-q 4 )(1-q 5 ) Π -y Π -z A C P a 1 c 1 q 1 R p ( x , y ) a 1 c 2 q 2 S p ( x , z ) a 2 c 3 q 3 A B P a 2 c 4 q 4 a 1 b 1 p 1 72 a 2 c 5 q 5 a 2 b 2 p 2

  44. [D&S’2004] The #P-Hard Queries Are precisely the non-hierarchical queries. Example: hd1 :- R( x ), S( x, y ), T( y ) More general: q :- …, R( x , …), S( x, y , …), T( y , …) , … Theorem Testing if q is PTIME or #P-hard is in AC 0 73

  45. Quiz: What is their complexity ? q PTIME or #P ? R( x , y ),S( y , a , u ),T( y , y , v ) R( x , y ), S( x , y , z ), T( x , z ) R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) 74

  46. Hint… q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z 75

  47. …Answer q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) PTIME R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) #P T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u #P T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z PTIME 76

  48. Case 2: CQ 1 +Disjoint/independent • Dichotomy: in [Dalvi et al.’06,Dalvi&S’07] • Some safe plans also in [Andritsos’2006] • CQ 1 (conjunctive queries, no self-joins) • Independent/independent tables are OK Theorem Forall q ∈ CQ 1 • q has a safe plan and is in PTIME, OR • q is #P-hard 77

  49. The PTIME Queries Algorithm : find a Safe Plan 1. Root variable u  Π i -u 2. Variable u occurs in a subgoal with constant keys  Π D -u 3. Connected components  Join • Single subgoal  Leaf node y P q(y) :- R( x ,y,z) b 1-(1-p1-p2)(1-p3-p4) x y P i Π -x a1 b p1+p2 q1(x c ,y c ):-R( x c ,y c ,z) a2 b p3+p4 x y z P D Π -z b c1 p1 a1 b c2 p2 R( x ,y,z) b c1 p3 78 a2 b c2 p4

  50. D Π -u Disjoint project R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) ⋈ u y x D Π -y T S R W p (‘a’,u) ⋈ y W U Disjoint u project I Π -x ⋈ x T p (y) U p (u,y) Independent project R p (x) S p (x,y) 79

  51. [Dalvi&S’2007] The #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) There are variations on hd2, hd3 hd2 = R( x ,y), S( y ) (see paper) hd3 = R( x ,y), S(x, y ) In general, a query is #P-hard if it can be “rewritten” to hd1, hd2, hd3 or one of their “variations”. Theorem Testing if q is PTIME or #P-hard is PTIME complete 80

  52. [Dalvi&S’2007b] Case 3: Any conjunctive query, independent tables Let q be hierarchical • x ⊇ y denotes: x is above y in the hierarchy • x ≡ y denotes: x ⊇ y and x ⊆ y Definition An inversion is a chain of unifications: x ⊃ y with u 1 ≡ v 1 with … with u n ≡ v n with x’ ⊂ y' Theorem Forall q ∈ CQ: • If q is non-hierarchical, or has an inversion* then it is #P-hard • Otherwise it is in PTIME 81 *without “eraser”: see paper.

  53. [Dalvi&S’2007b] The #P-hard Queries Hierarchical queries with “inversions”: hi1 = R( x ), S( x , y ), S( x’ , y’ ), T( y’ ) x ⊃ y unifies with x’ ⊂ y’ x y’ R S S T y x’ hi2 = R( x ), S( x , y ), S( u , v ), S’( u , v ),S’( x’ , y’ ), T( y’ ) x ⊃ y unifies with u ≡ v, which unifies with x’ ⊂ y’ u v x y’ R S S S’ S’ T y x’ 82

  54. The #P-hard Queries A query with a long inversion: hi k = R( x ), S 0 ( x , y ), S 0 ( u 1 , v 1 ), S 1 ( u 1 , v 1 ) S 1 ( u 2 , v 2 ), S 2 ( u 2 , v 2 ), . . . S k ( x ’, y ’), T( y’ ) 83

  55. The #P-hard Queries Sometimes inversions are exposed only after making a copy of the query q = R( x , y ), R( y , z ) R(x,y),R(y,z) R(x’,y’), R(y’,z’) 84

  56. The PTIME Queries Find movies with high reviews from Joe and Jim: q(x) :- Movie(x,y),Match(x,r), Review(r,Joe,s), s > 4 Match(x,r’), Review(r’,Jim,s’),s’>4 Unify, but Don’t no inversion unify Note: the query is hierarchical because x is a “constant” 85

  57. [Dalvi&S’2007b] The PTIME Queries Note: no “safe plans” are known ! PTIME algorithm for an inversion-free query is given in terms of expressions, not plans. Example: q :- R( a , x ), R( y , b ) p(q) = p(R(a,b))+(1-p(R(a,b))(1-(1- ∏ y ∈ Dom,y ≠ a (1-p(R(y,b))))(1- ∏ x ∈ Dom,x ≠ b (1-p(R(a,x)))) Open Problem : what are the natural operators that allow us to compute inversion-free queries in a database engine ? 86

  58. Query Com- Why plexity R(a,x), R(y,b) PTIME b a R(a,x), R(x,b) PTIME a b R(x,y), R(y,z) #P Inversion R(x,y),R(y,z),R(z,u) #P Non- hierarchical R(x,y),R(y,z),R(z,x) #P Non- hierarchical R(x,y),R(y,z),R(x,z) #P Non- hierarchical 87

  59. History • [Graedel, Gurevitch, Hirsch’98] – L(x,y),R(x,z),S(y),S(z) is #P-hard This is non-hierarchical, with a self-join • [Dalvi&S’2004] – R(x),S(x,y),T(y) is #P-hard This is non-hierarchical, w/o self-joins – Without self-joins: non-hierarchical = #P-hard, and hierarchical = PTIME • [Dalvi&S’2007] – All non-hierarchical queries are #P-hard 88

  60. Summary on the Dichotomy WHY WE CARE: Safe queries = most powerful optimization we have What we know: • Three dichotomies, of increasing complexity • Dichotomy for aggregates in HAVING [Re&S.2007] What is open • CQ + independent/disjoint • Extensions to ≤ , ≥ , ≠ • Extensions to unions of conjunctive queries 89

  61. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 90

  62. Implementation and Optimization Topics: • General probabilistic inference • Optimization 1: Safe-subplans • Optimization 2: Top K • Performance of MystiQ 91

  63. General Query Evaluation • Query q + database DB  boolean expression Φ q DB • Run any probabilistic inference algorithm on Φ q DB This approach is taken in Trio 92

  64. Background: Probability of Boolean Expressions Given: P(X 1 )= p 1 , P(X 2 )= p 2 , P(X 3 )= p 3 Φ = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Compute P( Φ ) X 1 X 2 X 3 P Φ Pr( Φ )=(1-p 1 )p 2 p 3 + 0 0 0 0 p 1 (1-p 2 )p 3 + 0 0 1 0 0 1 0 0 p 1 p 2 (1-p 3 ) + (1-p 1 )p 2 p 3 0 1 1 1 Ω = p 1 p 2 p 3 1 0 0 0 p 1 (1-p 2 )p 3 1 0 1 1 p 1 p 2 (1-p 3 ) 1 1 0 1 #P-complete [Valiant:1979] 93 p 1 p 2 p 3 1 1 1 1

  65. Query q + Database PDB  Φ R( x , y ), S( x , z ) q= S p R p A C P PDB= A B P a 1 c 1 q 1 Y 1 a 1 b 1 p 1 X 1 a 1 c 2 q 2 Y 2 a 2 b 2 p 2 X 2 a 2 c 3 q 3 Y 3 a 2 c 4 q 4 Y 4  a 2 c 5 q 5 Y 5 Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 94

  66. Probabilistic Networks Nodes = random variables R( x , y ), S( x , z ) Edges = dependence Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 Ç Studied intensively in KR Typical networks: Ç Ç • Bayesian networks • Markov networks Æ Æ Æ Æ Æ • Boolean expressions X 1 X 2 Y 1 Y 2 Y 3 Y 4 Y 5 p 1 p 2 q 1 q 2 q 3 q 4 q 5

  67. Inference Algorithms for Boolean Expressions • Randomized: – Naïve Monte Carlo – Luby and Karp • Deterministic – Algorithmic guarantees: [Trevisan’04], [Luby&Velickovic’91] – Inference algorithms in AI: variable elimination, junction trees,… – Tractable cases: bounded-width trees [Zabiyaka&Darwiche’06] 96

  68. Naive Monte Carlo Simulation E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Cnt à 0 X 1 X 2 X 1 X 3 repeat N times X 2 X 3 randomly choose X 1 , X 2 , X 3 2 {0,1} if E(X 1 , X 2 , X 3 ) = 1 then Cnt = Cnt+1 May be big P = Cnt/N (in theory) return P /* ' Pr(E) */ Theorem (0-1 estimator) If N ¸ (1/ Pr(E)) £ (4ln(2/ δ )/ ε 2 ) then Pr[ | P/Pr(E) - 1 | > ε ] < δ 97

  69. [Graedel,Gurevitch,Hirsch:1998] [Karp&Luby:1983] Improved Monte Carlo Simulation E = C 1 Ç C 2 Ç . . . Ç C m Cnt à 0; S à Pr(C 1 ) + … + Pr(C m ); repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(C i ) / S randomly choose X 1 , …, X n 2 {0,1} s.t. C i = 1 if C 1 =0 and C 2 =0 and … and C i-1 = 0 Now it’s then Cnt = Cnt+1 in PTIME P = Cnt/N * S / 2 n return P /* ' Pr(E) */ Theorem . If N ¸ (1/ m) £ (4ln(2/ δ )/ ε 2 ) then: Pr[ | P/Pr(E) - 1 | > ε ] < δ 98

  70. [Re,Dalvi&S’2007] An Example q(x,u) :- R p ( x , y ), S p ( y , z ), T p ( z ,u) R p S p T p A B B C C D P P P b1 p1 b1 c1 q1 d1 r1 a1 c1 b2 p2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 c2 c3 q4 d2 r4 d3 r5 Step 1: evaluate this query on the representation to get the data qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) 99

  71. R p S p T p A B B C P P C D P a1 b1 p1 b1 c1 q1 d1 r1 c1 a1 b2 p2 b2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 b2 c3 q4 c2 d2 r4 d3 r5 qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) Temp  A B P B C P C D P a1 b1 p1 b1 c1 q1 c1 d1 r1 a1 b2 p2 b2 c2 q3 c2 d1 r3 a2 b1 . . . . . . . . 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend