Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli - PowerPoint PPT Presentation

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department

XML for semi-structured data ( tree-like structure) 2

Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3

Context Uncertainty 4

Context  In many of these tasks, information is described in a semi-structured manner  Especially when the source (e.g., XML or HTML) is already in this form  Representation by means of a hierarchy of nodes is natural 5

Outline 1. PrXML Models Local Dependency Long-distance Dependency 2. Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3. The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4. Conclusions 6

PrXML Models – Local Dependency Local dependency ( mux and ind nodes) 7

PrXML Models – Long-distance Dependency Long-distance dependency Local dependency (Conjunction of independent events  cie ) ( mux and ind nodes) Ancestor node Parent node Tractable . Parent node . translation . ¬ e 1 e 1 0.3 0.7 Child Parent node . Child . Child Child . Parent node (With e 1 = 0.3) e 3 Λ e 4 e 2 PrXML {ind,mux} Child Child e 2 PrXML {cie} PrXML {cie} Child S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellart. 2009 8

Example Repository Pr (e 1 ) = .9 Pr (e 2 ) = .8 . . . Pr (e 3 ) = .4 Pr (e 4 ) = .1 Employee Pr (e 5 ) = .6 Pr (e 6 ) = .3 Pr (e 7 ) = .2 Details e 2 Pr (e 8 ) = .8 Name t 2 t 1 Asma Souihli . . . work Contact e 8 e 3 e 1 e 6 e 4 Λ¬ e 5 e 1 Place e 7 Place Phone e-mail e-mail address Paris 13 souihli@enst.fr 0622330011 e 5 address asma.souihli@gmail.com Telecom Gaumont Paristech Pathé Paris 15 9

Querying P-documents – Types of Queries o Tree Pattern Queries A A (TPQ) B B C C D A o Tree Pattern Queries with joins (TPQJ) B C D 11

Example  Q1: / Employee [Name= "Asma Souihli"] // e-mail / text() Repository e 1 = .9 e 2 = .8 Employee e 9 = .6 e 10 = .7 e 2 enst.fr: e 2 Λ e 8 Λ e 1 C1 Details e 6 = .3 Name e 8 = .8 gmail.com: e 2 Λ e 8 Λ e 6 C2 t 1 t 2 Asma Souihli e 9 e 8 sap.com: e 2 Λ e 9 Λ e 10 C3 Contact Contact e 1 e 6 e 10 e 6 gmail.com : e 2 Λ e 9 Λ e 6 C4 e-mail e-mail e-mail e-mail souihli@enst.fr souihli@sap.com asma.souihli@gmail.com asma.souihli@gmail.com 12

Querying PrXML – Probabilistic Lineage  Probability to find an e-mail: Probabilistic lineage Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) (DNF shape)  Possible results: Pr( asma.souihli@gmail.com ) = Pr( C2 V C4 ) Pr( souihli@enst.fr ) = Pr( C1 ) Pr( souihli@sap.com ) = Pr( C3 ) 13

Querying PrXML – Probabilistic Lineage  When is a linear computation possible? o if C 1 and C 2 are independent, then: Λ Pr( C 1 ∧ C 2 ) = Pr( C 1 ) × Pr( C 2 ) Pr( C 1 ∨ C 2 ) = 1 − ( (1 − Pr( C 1 ) ) × (1 − Pr( C 2 )) ) V o if C 1 and C 2 are inconsistent (mutually exclusive), then: Pr( C 1 ∨ C 2 ) = Pr( C 1 ) + Pr( C 2 ) + 14

Back to the Example e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 Pr( @enst.fr ) = Pr( C1 ) = Pr(e 2 Λ e 8 Λ e 1 ) = .8 x .8 x.9 e 6 = .3 e 8 = .8 = 0.576 Pr( @sap.com ) = Pr( C3 ) = 0.336 Pr( @gmail.com ) = Pr( C2 V C4 ) = (e 2 Λ e 8 Λ e 6 ) V ( e 2 Λ e 9 Λ e 6 ) Factorization: Pr ( @gmail.com ) = (e 2 Λ e 6 ) Λ (e 8 V e 9 ) = .8 x .3 x (1 -(1-.8)(1-.6)) = 0.2208 15

Querying PrXML – Probabilistic Lineage e 1 = .9 e 2 = .8 e 9 = .6 e 10 = .7 e 6 = .3 e 8 = .8 Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) = Pr [ (e 2 Λ e 8 Λ e 1 ) V (e 2 Λ e 8 Λ e 6 ) V (e 2 Λ e 9 Λ e 10 ) V (e 2 Λ e 9 Λ e 6 ) ] Factorization: = Pr [e 2 Λ ( (e 8 Λ (e 1 V e 6 ) ) V (e 9 Λ (e 10 V e 6 ) ) ) ] Difficult to evaluate ! 16

Solutions..  One possible (naïve) way, is to find the truth value assignments that satisfy the propositional formula (probabilistic lineage) (out of 2 #literals possible assignments/worlds !)  And sum the probabilities of these satisfying assignments to get the answer e 1 e 2 e 6 e 8 e 9 e 10 Probability C1 V C2 V C3 V C4 false false false false false false 0.0845 false false false false false false true 0.3345 false false false false false true false 0.87 false … … … … … … … … 17

Querying PrXML – Complexity of Queries  Probabilities of the satisfying assignments for the DNF (lineage formula) : #P-Hard problem o No polynomial time algorithm for the exact solution if P≠NP o #P problems ask "how many" rather than "are there any“ How many graph coloring using k colors are there for a particular graph G? 18

Querying PrXML – Complexity of Queries  A union of sets (clauses) problem: #P-Hard problem  Pr (C ) Pr (e ) . Pr (e ) 1 1 2  Pr (C ) Pr (e ) . Pr (e ) 2 2 3   Pr (C C ) Pr (e ) . Pr (e ). Pr (e ) 1 2 1 2 3     Pr (C C ) Pr (C ) Pr (C ) - Pr (C C ) 1 2 1 2 1 2 For depend ent probab ilistic cl auses C . . .C 1 n the inclus ion-exclus ion princi ple become s: n n       k 1 Pr ( C ) ( 1 ) Pr (C ) i J    i 1 k 1 J { 1 ,...,n}  |J| k where:   C C J j  j J 19

The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 21

Back to the Example Q1: / Employee [Name= "Asma Souihli"] // e-mail / text()  To get the lineage for the boolean projection : for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli"] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) return text{distinct-values(for $att in $atts return string($att))}  To get lineages of answers: for $val in distinct-values(/employee [name="Asma Souihli"]//email/text()) order by $val return <match> {$val}{ for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli "] for $x3 in $x1//email/text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) where $x3=$val return <clause>{distinct-values(for $att in $atts return string($att))}</clause> }</match> 22

The ProApproX System [CIKM 2012, SIGMOD 2011]  Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query  Is built on top of a native XML DBMS  Processes the lineage formula to get the probability of the query (and of each matching answer) User input : XPath Query Q 1 2 3 Query BaseX PrXML database translation (querying) 4 5 Result Pr(Q) Answer ProApproX (Processing) User Interface Lineage Exploration (best Compilation Computation preprocessing execution plan) 23

The ProApproX System – Computation Algorithms  Additive approximation: o For a fixed error ε and a DNF F , A( F ) is an additive ε - approximation of Pr( F ) with a probability of at least δ (a fixed reliability factor) if: Pr( F ) - ε ≤ A( F ) ≤ Pr( F ) + ε  Multiplicative Approximation o For a fixed error ε, a DNF F, A(F) is an multiplicative ε - approximation of Pr(F) with a probability of at least δ if: (1 - ε ) Pr( F ) ≤ A( F ) ≤ (1 + ε ) Pr( F ) 24

DEMO 1 [SIGMOD 2011] 25

The ProApproX System – Computation Algorithms  Exact Computations: o The naïve algorithm – Possible worlds Finding the satisfying assignments out of 2 #variables possible truth value assignments 𝑃 (2 n ) o The sieve algorithm – The inclusion-exclusion principle Exponential in the number of clauses m 𝑃 (2 m ) 26

The ProApproX System – Computation Algorithms  Approximations: o Naïve Monte Carlo sampling for additive app. : Linear but could take exponentially many samples to converge to a good approximation for low probabilities o Biased Monte Carlo sampling for multiplicative app. : Kimelfeld, Running time grows in 𝑃 ( 𝑜 3 ln 𝑜 ) Kosharovsky, and Sagiv. in the number of clauses 2009 o Self-Adjusting Coverage Algorithm for the DNF probability problem: M. Karp, M. Luby, and N. Madras. 1989 Linear in the length of F times ln(1/ 𝜀 ) / 𝜁 2 27

The ProApproX System – Computation Algorithms  Possibility to derive a multiplicative approximation from an additive approximation ( and vice versa )  Cost models and cost constants: 28

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli - PowerPoint PPT Presentation

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department XML for semi-structured data ( tree-like structure) 2 Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3 Context

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

The problem Combining querying of XML data with ontology queries Example XML document

XML in databases. Some XML-related standards (XLink, XPointer, XForms). Patryk Czarnik XML and

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Relational Databases for Answer a lot of XML Queries Easy/Auto Effective Efficient Querying

XML and Databases Orcale 10gR2 XML DB SQL Server 2005 DB2 UDB 8.2 XML Extender Helia / Martti

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

Outline XML Documents: Quick Review on XML Limitations and Opportunities The Problem

XML and databases (and XForms) Patryk Czarnik XML and Applications 2013/2014 Week 13

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Ports, Protocols, and Processes: a Programming Paradigm? Peter Grogono Computer Science and

Interpolation by Polynomials with Symmetries on the Imaginary Axis Izchak Lewkowicz ECE

EECS 541 Computer Systems Design Laboratory Syllabus and Introduction Prasad Kulkarni Department

Evaluating the Performance of Reinforcement Learning Algorithms Scott Jordan , Yash Chandak,

Color Images CS/BIOEN 4640: Image Processing Basics April 5, 2012 RGB Color Space Source:

Lecturers: J. Kautz, A. Steed, T. Weyrich Demonstrator: James Tompkin Lab

A Preliminary Study on Reconstruc4ng Faded Color by Spectral

Light I June 15, 1999 Paper summaries on light Any takers? June 15, 1999 Motivational Film

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli - PowerPoint PPT Presentation

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science Department XML for semi-structured data ( tree-like structure) 2 Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung, 2012. 3 Context

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

The problem Combining querying of XML data with ontology queries Example XML document

XML in databases. Some XML-related standards (XLink, XPointer, XForms). Patryk Czarnik XML and

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Relational Databases for Answer a lot of XML Queries Easy/Auto Effective Efficient Querying

XML and Databases Orcale 10gR2 XML DB SQL Server 2005 DB2 UDB 8.2 XML Extender Helia / Martti

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller &amp; Michael I.

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

Outline XML Documents: Quick Review on XML Limitations and Opportunities The Problem

XML and databases (and XForms) Patryk Czarnik XML and Applications 2013/2014 Week 13

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Ports, Protocols, and Processes: a Programming Paradigm? Peter Grogono Computer Science and

Interpolation by Polynomials with Symmetries on the Imaginary Axis Izchak Lewkowicz ECE

EECS 541 Computer Systems Design Laboratory Syllabus and Introduction Prasad Kulkarni Department

Evaluating the Performance of Reinforcement Learning Algorithms Scott Jordan , Yash Chandak,

Color Images CS/BIOEN 4640: Image Processing Basics April 5, 2012 RGB Color Space Source:

Lecturers: J. Kautz, A. Steed, T. Weyrich Demonstrator: James Tompkin Lab

A Preliminary Study on Reconstruc4ng Faded Color by Spectral

Light I June 15, 1999 Paper summaries on light Any takers? June 15, 1999 Motivational Film

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.