querying probabilistic xml databases
play

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli - PowerPoint PPT Presentation

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department XML for semi-structured data ( tree-like structure) 2 Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3 Context


  1. Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department

  2. XML for semi-structured data ( tree-like structure) 2

  3. Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3

  4. Context Uncertainty 4

  5. Context  In many of these tasks, information is described in a semi-structured manner  Especially when the source (e.g., XML or HTML) is already in this form  Representation by means of a hierarchy of nodes is natural 5

  6. Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 6

  7. PrXML Models – Local Dependency Local dependency ( mux and ind nodes) 7

  8. PrXML Models – Long-distance Dependency Long-distance dependency Local dependency (Conjunction of independent events  cie ) ( mux and ind nodes) Ancestor node Parent node Tractable . Parent node . translation . ¬ e 1 e 1 0.3 0.7 Child Parent node . Child . Child Child . Parent node (With e 1 = 0.3) e 3 Λ e 4 e 2 PrXML {ind,mux} Child Child e 2 PrXML {cie} PrXML {cie} Child S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellart. 2009 8

  9. Example Repository Pr (e 1 ) = .9 Pr (e 2 ) = .8 . . . Pr (e 3 ) = .4 Pr (e 4 ) = .1 Employee Pr (e 5 ) = .6 Pr (e 6 ) = .3 Pr (e 7 ) = .2 Details e 2 Pr (e 8 ) = .8 Name t 2 t 1 Asma Souihli . . . work Contact e 8 e 3 e 1 e 6 e 4 Λ¬ e 5 e 1 Place e 7 Place Phone e-mail e-mail address Paris 13 souihli@enst.fr 0622330011 e 5 address asma.souihli@gmail.com Telecom Gaumont Paristech Pathé Paris 15 9

  10. Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 10

  11. Querying P-documents – Types of Queries o Tree Pattern Queries A A (TPQ) B B C C D A o Tree Pattern Queries with joins (TPQJ) B C D 11

  12. Example  Q1: / Employee [Name= "Asma Souihli"] // e-mail / text() Repository e 1 = .9 e 2 = .8 Employee e 9 = .6 e 10 = .7 e 2 enst.fr: e 2 Λ e 8 Λ e 1 C1 Details e 6 = .3 Name e 8 = .8 gmail.com: e 2 Λ e 8 Λ e 6 C2 t 1 t 2 Asma Souihli e 9 e 8 sap.com: e 2 Λ e 9 Λ e 10 C3 Contact Contact e 1 e 6 e 10 e 6 gmail.com : e 2 Λ e 9 Λ e 6 C4 e-mail e-mail e-mail e-mail souihli@enst.fr souihli@sap.com asma.souihli@gmail.com asma.souihli@gmail.com 12

  13. Querying PrXML – Probabilistic Lineage  Probability to find an e-mail: Probabilistic lineage Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) (DNF shape)  Possible results: Pr( asma.souihli@gmail.com ) = Pr( C2 V C4 ) Pr( souihli@enst.fr ) = Pr( C1 ) Pr( souihli@sap.com ) = Pr( C3 ) 13

  14. Querying PrXML – Probabilistic Lineage  When is a linear computation possible? o if C 1 and C 2 are independent, then: Λ Pr( C 1 ∧ C 2 ) = Pr( C 1 ) × Pr( C 2 ) Pr( C 1 ∨ C 2 ) = 1 − ( (1 − Pr( C 1 ) ) × (1 − Pr( C 2 )) ) V o if C 1 and C 2 are inconsistent (mutually exclusive), then: Pr( C 1 ∨ C 2 ) = Pr( C 1 ) + Pr( C 2 ) + 14

  15. Back to the Example e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 Pr( @enst.fr ) = Pr( C1 ) = Pr(e 2 Λ e 8 Λ e 1 ) = .8 x .8 x.9 e 6 = .3 e 8 = .8 = 0.576 Pr( @sap.com ) = Pr( C3 ) = 0.336 Pr( @gmail.com ) = Pr( C2 V C4 ) = (e 2 Λ e 8 Λ e 6 ) V ( e 2 Λ e 9 Λ e 6 ) Factorization: Pr ( @gmail.com ) = (e 2 Λ e 6 ) Λ (e 8 V e 9 ) = .8 x .3 x (1 -(1-.8)(1-.6)) = 0.2208 15

  16. Querying PrXML – Probabilistic Lineage e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 e 6 = .3 e 8 = .8 Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) = Pr [ (e 2 Λ e 8 Λ e 1 ) V (e 2 Λ e 8 Λ e 6 ) V (e 2 Λ e 9 Λ e 10 ) V (e 2 Λ e 9 Λ e 6 ) ] Factorization: = Pr [e 2 Λ ( (e 8 Λ (e 1 V e 6 ) ) V (e 9 Λ (e 10 V e 6 ) ) ) ] Difficult to evaluate ! 16

  17. Solutions..  One possible (naïve) way, is to find the truth value assignments that satisfy the propositional formula (probabilistic lineage) (out of 2 #literals possible assignments/worlds !)  And sum the probabilities of these satisfying assignments to get the answer e 1 e 2 e 6 e 8 e 9 e 10 Probability C1 V C2 V C3 V C4 false false false false false false 0.0845 false false false false false false true 0.3345 false false false false false true false 0.87 false … … … … … … … … 17

  18. Querying PrXML – Complexity of Queries  Probabilities of the satisfying assignments for the DNF (lineage formula) : #P-Hard problem o No polynomial time algorithm for the exact solution if P≠NP o #P problems ask "how many" rather than "are there any“ How many graph coloring using k colors are there for a particular graph G? 18

  19. Querying PrXML – Complexity of Queries  A union of sets (clauses) problem: #P-Hard problem  Pr (C ) Pr (e ) . Pr (e ) 1 1 2  Pr (C ) Pr (e ) . Pr (e ) 2 2 3   Pr (C C ) Pr (e ) . Pr (e ). Pr (e ) 1 2 1 2 3     Pr (C C ) Pr (C ) Pr (C ) - Pr (C C ) 1 2 1 2 1 2 For depend ent probab ilistic cl auses C . . .C 1 n the inclus ion-exclus ion princi ple become s: n n       k 1 Pr ( C ) ( 1 ) Pr (C ) i J    i 1 k 1 J { 1 ,...,n}  |J| k where:   C C J j  j J 19

  20. Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 20

  21. The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 21

  22. Back to the Example Q1: / Employee [Name= "Asma Souihli"] // e-mail / text()  To get the lineage for the boolean projection : for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli"] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) return text{distinct-values(for $att in $atts return string($att))}  To get lineages of answers: for $val in distinct-values(/employee [name="Asma Souihli"]//email/text()) order by $val return <match> {$val}{ for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli "] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) where $x3=$val return <clause>{distinct-values(for $att in $atts return string($att))}</clause> }</match> 22

  23. The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query  Is built on top of a native XML DBMS  Processes the lineage formula to get the probability of the query (and of each matching answer) User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 23

  24. The ProApproX System – Computation Algorithms  Additive approximation: o For a fixed error ε and a DNF F , A( F ) is an additive ε - approximation of Pr( F ) with a probability of at least δ (a fixed reliability factor) if: Pr( F ) - ε ≤ A( F ) ≤ Pr( F ) + ε  Multiplicative Approximation o For a fixed error ε, a DNF F, A(F) is an multiplicative ε - approximation of Pr(F) with a probability of at least δ if: (1 - ε ) Pr( F ) ≤ A( F ) ≤ (1 + ε ) Pr( F ) 24

  25. DEMO 1 [SIGMOD 2011] 25

  26. The ProApproX System – Computation Algorithms  Exact Computations: o The naïve algorithm – Possible worlds Finding the satisfying assignments out of 2 #variables possible truth value assignments 𝑃 (2 n ) o The sieve algorithm – The inclusion-exclusion principle Exponential in the number of clauses m 𝑃 (2 m ) 26

  27. The ProApproX System – Computation Algorithms  Approximations: o Naïve Monte Carlo sampling for additive app. : Linear but could take exponentially many samples to converge to a good approximation for low probabilities o Biased Monte Carlo sampling for multiplicative app. : Kimelfeld, Running time grows in 𝑃 ( 𝑜 3 ln 𝑜 ) Kosharovsky, and Sagiv. in the number of clauses 2009 o Self-Adjusting Coverage Algorithm for the DNF probability problem: M. Karp, M. Luby, and N. Madras. 1989 Linear in the length of F times ln(1/ 𝜀 ) / 𝜁 2 27

  28. The ProApproX System – Computation Algorithms  Possibility to derive a multiplicative approximation from an additive approximation ( and vice versa )  Cost models and cost constants: 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend