Querying Probabilistic XML Databases
Asma Souihli
- Sept. 21st 2012
Network and Computer Science Department
Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli - - PowerPoint PPT Presentation
Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department XML for semi-structured data ( tree-like structure) 2 Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3 Context
Network and Computer Science Department
2
Jung-Hee Yun and Chin-Wan Chung, 2012.
3
4
5
6
(mux and ind nodes)
7
(Conjunction of independent events cie)
(mux and ind nodes)
Parent node Child Child
0.3 0.7
Ancestor node Child Child
e2 e3 Λ e4
. . .
Parent node Parent node
. . .
e2
Parent node Child Child
e1
Tractable translation
PrXML{ind,mux} PrXML{cie} PrXML{cie}
and P. Senellart. 2009
(With e1 = 0.3)
Child
8
Repository t1 t2 Employee Name Details work Place Place
Telecom Paristech Gaumont Pathé
Contact Phone address e-mail e-mail
0622330011 souihli@enst.fr asma.souihli@gmail.com Paris 13 Asma Souihli
. . . . . .
e2 e3 e4 Λ¬e5
address
Paris 15
e5 e1 e1 e6 e7 e8
Pr (e1) = .9 Pr (e2) = .8 Pr (e3) = .4 Pr (e4) = .1 Pr (e5) = .6 Pr (e6) = .3 Pr (e7) = .2 Pr (e8) = .8
9
10
11
A B C A B C D A B C D
enst.fr: e2 Λ e8 Λ e1 C1 gmail.com: e2 Λ e8 Λ e6 C2 sap.com: e2 Λ e9 Λ e10 C3 gmail.com : e2 Λ e9 Λ e6 C4
Repository
t1
Employee Name Details Contact e-mail e-mail
souihli@enst.fr asma.souihli@gmail.com Asma Souihli
e2 e1 e6 e8
t2
e-mail e-mail
asma.souihli@gmail.com
e10 e6
Contact
souihli@sap.com
e9 e1 = .9 e2 = .8 e9 = .6 e10 = .7 e6 = .3 e8 = .8
12
13
14
e1 = .9 e2 = .8 e9 = .6 e10 = .7 e6 = .3 e8 = .8
15
e1 = .9 e2 = .8 e9 = .6 e10 = .7 e6 = .3 e8 = .8
16
e1 e2 e6 e8 e9 e10 Probability
C1 V C2 V C3 V C4
false false false false false false 0.0845 false false false false false false true 0.3345 false false false false false true false 0.87 false … … … … … … … …
17
18
How many graph coloring using k colors are there for a particular graph G?
19
J j j J k |J| ,...,n} 1 { J J n 1 k 1 k n 1 i i n 1
C C where: ) (C Pr ) 1 ( ) C ( Pr s: ple become ion princi ion-exclus the inclus . . .C auses C ilistic cl ent probab For depend
) C (C ) - (C ) (C ) C (C ) (e ). (e ) . (e ) C (C ) (e ) . (e ) (C ) (e ) . (e ) (C
2 1 2 1 2 1 3 2 1 2 1 3 2 2 2 1 1
Pr Pr Pr Pr Pr Pr Pr Pr Pr Pr Pr Pr Pr Pr
20
21
Query translation BaseX
(querying)
PrXML database ProApproX
(Processing) Lineage preprocessing
Compilation Exploration (best
execution plan)
Computation
User input : XPath Query Q
5 Result Pr(Q)
Answer
1 2 3 4
User Interface
[CIKM 2012, SIGMOD 2011]
for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli"] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) return text{distinct-values(for $att in $atts return string($att))}
for $val in distinct-values(/employee [name="Asma Souihli"]//email/text())
return <match> {$val}{ for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli "] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) where $x3=$val return <clause>{distinct-values(for $att in $atts return string($att))}</clause> }</match> 22
query (and of each matching answer)
23
Query translation BaseX
(querying)
PrXML database ProApproX
(Processing) Lineage preprocessing
Compilation Exploration (best
execution plan)
Computation
User input : XPath Query Q
5 Result Pr(Q)
Answer
1 2 3 4
User Interface
[CIKM 2012, SIGMOD 2011]
Additive approximation:
ε-approximation of Pr(F) with a probability of at least δ (a fixed reliability factor) if:
Multiplicative Approximation
ε-approximation of Pr(F) with a probability of at least δ if:
24
25
26
Finding the satisfying assignments out of 2#variables possible truth value assignments
Exponential in the number of clauses m
Linear but could take exponentially many samples to converge to a good approximation for low probabilities
Running time grows in 𝑃(𝑜3 ln 𝑜) in the number of clauses
Linear in the length of F times ln(1/𝜀) /𝜁2
27
Kimelfeld, Kosharovsky, and Sagiv. 2009
28
29
e3 Λ (e4 V e5) Factorization
Exact /naïve Algo. OR Approximation
(e6 Λ e8) (e1 Λ e2) (e3 Λ e4) (e3 Λ e5) (¬e3) (e6 Λ e7) (e8)
V V V V V V
30
31
approximations of Pr(𝜔1) and Pr(𝜔2), to a factor of 𝜁1 and 𝜁2, respectively. Then 1-(1- p̃1)(1- p̃2) is an additive approximation of Pr(𝜚) to a factor of 𝜁 if:
V
cost𝜔1=1 cost𝜚=200 cost𝜔2=35 cost𝜔3=8 cost𝜔4=6 cost𝜔7=1 cost𝜔8=15 cost𝜔5=3 cost𝜔6=2 cost𝜔9=8 cost𝜔10=12 cost𝜔11=10 cost𝜔12=9
Running time of the different algorithms on the MondialDB dataset
33
Proportion of time (MondialDB - Best Tree ) Relative error on the probabilities computed by the algorithm on the MondialDB over each non join query with respect to the exact probability values (𝜁 = 0.1, 𝜀 = 95%)
34
Running time of the different algorithms on a given query of the movie dataset. (times greater than 5s are not shown)
35
Running time of the different algorithms
36
37
38
39
architectures
40