 
              Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department
XML for semi-structured data ( tree-like structure) 2
Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3
Context Uncertainty 4
Context  In many of these tasks, information is described in a semi-structured manner  Especially when the source (e.g., XML or HTML) is already in this form  Representation by means of a hierarchy of nodes is natural 5
Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 6
PrXML Models – Local Dependency Local dependency ( mux and ind nodes) 7
PrXML Models – Long-distance Dependency Long-distance dependency Local dependency (Conjunction of independent events  cie ) ( mux and ind nodes) Ancestor node Parent node Tractable . Parent node . translation . ¬ e 1 e 1 0.3 0.7 Child Parent node . Child . Child Child . Parent node (With e 1 = 0.3) e 3 Λ e 4 e 2 PrXML {ind,mux} Child Child e 2 PrXML {cie} PrXML {cie} Child S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellart. 2009 8
Example Repository Pr (e 1 ) = .9 Pr (e 2 ) = .8 . . . Pr (e 3 ) = .4 Pr (e 4 ) = .1 Employee Pr (e 5 ) = .6 Pr (e 6 ) = .3 Pr (e 7 ) = .2 Details e 2 Pr (e 8 ) = .8 Name t 2 t 1 Asma Souihli . . . work Contact e 8 e 3 e 1 e 6 e 4 Λ¬ e 5 e 1 Place e 7 Place Phone e-mail e-mail address Paris 13 souihli@enst.fr 0622330011 e 5 address asma.souihli@gmail.com Telecom Gaumont Paristech Pathé Paris 15 9
Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 10
Querying P-documents – Types of Queries o Tree Pattern Queries A A (TPQ) B B C C D A o Tree Pattern Queries with joins (TPQJ) B C D 11
Example  Q1: / Employee [Name= "Asma Souihli"] // e-mail / text() Repository e 1 = .9 e 2 = .8 Employee e 9 = .6 e 10 = .7 e 2 enst.fr: e 2 Λ e 8 Λ e 1 C1 Details e 6 = .3 Name e 8 = .8 gmail.com: e 2 Λ e 8 Λ e 6 C2 t 1 t 2 Asma Souihli e 9 e 8 sap.com: e 2 Λ e 9 Λ e 10 C3 Contact Contact e 1 e 6 e 10 e 6 gmail.com : e 2 Λ e 9 Λ e 6 C4 e-mail e-mail e-mail e-mail souihli@enst.fr souihli@sap.com asma.souihli@gmail.com asma.souihli@gmail.com 12
Querying PrXML – Probabilistic Lineage  Probability to find an e-mail: Probabilistic lineage Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) (DNF shape)  Possible results: Pr( asma.souihli@gmail.com ) = Pr( C2 V C4 ) Pr( souihli@enst.fr ) = Pr( C1 ) Pr( souihli@sap.com ) = Pr( C3 ) 13
Querying PrXML – Probabilistic Lineage  When is a linear computation possible? o if C 1 and C 2 are independent, then: Λ Pr( C 1 ∧ C 2 ) = Pr( C 1 ) × Pr( C 2 ) Pr( C 1 ∨ C 2 ) = 1 − ( (1 − Pr( C 1 ) ) × (1 − Pr( C 2 )) ) V o if C 1 and C 2 are inconsistent (mutually exclusive), then: Pr( C 1 ∨ C 2 ) = Pr( C 1 ) + Pr( C 2 ) + 14
Back to the Example e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 Pr( @enst.fr ) = Pr( C1 ) = Pr(e 2 Λ e 8 Λ e 1 ) = .8 x .8 x.9 e 6 = .3 e 8 = .8 = 0.576 Pr( @sap.com ) = Pr( C3 ) = 0.336 Pr( @gmail.com ) = Pr( C2 V C4 ) = (e 2 Λ e 8 Λ e 6 ) V ( e 2 Λ e 9 Λ e 6 ) Factorization: Pr ( @gmail.com ) = (e 2 Λ e 6 ) Λ (e 8 V e 9 ) = .8 x .3 x (1 -(1-.8)(1-.6)) = 0.2208 15
Querying PrXML – Probabilistic Lineage e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 e 6 = .3 e 8 = .8 Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) = Pr [ (e 2 Λ e 8 Λ e 1 ) V (e 2 Λ e 8 Λ e 6 ) V (e 2 Λ e 9 Λ e 10 ) V (e 2 Λ e 9 Λ e 6 ) ] Factorization: = Pr [e 2 Λ ( (e 8 Λ (e 1 V e 6 ) ) V (e 9 Λ (e 10 V e 6 ) ) ) ] Difficult to evaluate ! 16
Solutions..  One possible (naïve) way, is to find the truth value assignments that satisfy the propositional formula (probabilistic lineage) (out of 2 #literals possible assignments/worlds !)  And sum the probabilities of these satisfying assignments to get the answer e 1 e 2 e 6 e 8 e 9 e 10 Probability C1 V C2 V C3 V C4 false false false false false false 0.0845 false false false false false false true 0.3345 false false false false false true false 0.87 false … … … … … … … … 17
Querying PrXML – Complexity of Queries  Probabilities of the satisfying assignments for the DNF (lineage formula) : #P-Hard problem o No polynomial time algorithm for the exact solution if P≠NP o #P problems ask "how many" rather than "are there any“ How many graph coloring using k colors are there for a particular graph G? 18
Querying PrXML – Complexity of Queries  A union of sets (clauses) problem: #P-Hard problem  Pr (C ) Pr (e ) . Pr (e ) 1 1 2  Pr (C ) Pr (e ) . Pr (e ) 2 2 3   Pr (C C ) Pr (e ) . Pr (e ). Pr (e ) 1 2 1 2 3     Pr (C C ) Pr (C ) Pr (C ) - Pr (C C ) 1 2 1 2 1 2 For depend ent probab ilistic cl auses C . . .C 1 n the inclus ion-exclus ion princi ple become s: n n       k 1 Pr ( C ) ( 1 ) Pr (C ) i J    i 1 k 1 J { 1 ,...,n}  |J| k where:   C C J j  j J 19
Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 20
The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 21
Back to the Example Q1: / Employee [Name= "Asma Souihli"] // e-mail / text()  To get the lineage for the boolean projection : for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli"] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) return text{distinct-values(for $att in $atts return string($att))}  To get lineages of answers: for $val in distinct-values(/employee [name="Asma Souihli"]//email/text()) order by $val return <match> {$val}{ for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli "] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) where $x3=$val return <clause>{distinct-values(for $att in $atts return string($att))}</clause> }</match> 22
The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query  Is built on top of a native XML DBMS  Processes the lineage formula to get the probability of the query (and of each matching answer) User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 23
The ProApproX System – Computation Algorithms  Additive approximation: o For a fixed error ε and a DNF F , A( F ) is an additive ε - approximation of Pr( F ) with a probability of at least δ (a fixed reliability factor) if: Pr( F ) - ε ≤ A( F ) ≤ Pr( F ) + ε  Multiplicative Approximation o For a fixed error ε, a DNF F, A(F) is an multiplicative ε - approximation of Pr(F) with a probability of at least δ if: (1 - ε ) Pr( F ) ≤ A( F ) ≤ (1 + ε ) Pr( F ) 24
DEMO 1 [SIGMOD 2011] 25
The ProApproX System – Computation Algorithms  Exact Computations: o The naïve algorithm – Possible worlds Finding the satisfying assignments out of 2 #variables possible truth value assignments 𝑃 (2 n ) o The sieve algorithm – The inclusion-exclusion principle Exponential in the number of clauses m 𝑃 (2 m ) 26
The ProApproX System – Computation Algorithms  Approximations: o Naïve Monte Carlo sampling for additive app. : Linear but could take exponentially many samples to converge to a good approximation for low probabilities o Biased Monte Carlo sampling for multiplicative app. : Kimelfeld, Running time grows in 𝑃 ( 𝑜 3 ln 𝑜 ) Kosharovsky, and Sagiv. in the number of clauses 2009 o Self-Adjusting Coverage Algorithm for the DNF probability problem: M. Karp, M. Luby, and N. Madras. 1989 Linear in the length of F times ln(1/ 𝜀 ) / 𝜁 2 27
The ProApproX System – Computation Algorithms  Possibility to derive a multiplicative approximation from an additive approximation ( and vice versa )  Cost models and cost constants: 28
Recommend
More recommend