Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank - PowerPoint PPT Presentation

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank Neven and Stijn Vansummeren Hasselt University and transnational University of Limburg, Belgium

Overview • Introduction • Complete algorithm i L OCAL • Heuristic i XSD • Experiments • Conclusions

Motivation for schemas • Why schemas? – automation & optimization of search – integration of XML data sources – translation & processing of XML data – used by software tools, e.g., JAXB, Castor – schema matching & model management • Why infer schemas? – 50 % of XML document on the web have none [Barbosa et al., 2005] – 33 % of schemas are not valid [Bex et al., 2004, 2005] real world XML & XSDs

Motivation for XSD inference • DTD inference – XTract [Garofalakis et al., 2003] – trang [Clark] – i DTD [Bex et al., 2006] • XSD inference – trang output XSD syntax, – XStruct but equivalent to DTD – JAXB, .Net expressive power limited to that of DTDs!

How do DTDs and XSDs differ? store order order stock customer item item customer item item item id qty price id qty price id qty price id qty item id qty id qty in DTDs, either: item → id, qty, (price + item*) or order_item → id, qty, price stock_item → id, qty, stock_item* can be done in XSDs

XSD: abstract syntax <xsd:element name=" store " type=" store "/> <xsd:complexType name=" store "> <xsd:sequence> <xsd:element name=" order " type=" order " minOccurs="0" maxOccurs="unbounded"/> <xsd:element name=" stock " type=" stock "/> </xsd:sequence> </xsd:complexType> <xsd:complexType name=" order "> <xsd:sequence> <xsd:element name=" customer " type=" customer "/> <xsd:element name=" item " type=" item1 " minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> root → store [ store ] store → order [ order ]*, stock [ stock ] order → customer [ customer ], item [ item 1 ] +

Motivating example for XSD store order order stock customer item item customer item item item id qty price id qty price id qty price id qty item id qty id qty DTD: XSD: root → store root → store [ store ] store → order *, stock store → order [ order ]*, stock [ stock ] order → customer , item + order → customer [ customer ], item [ item 1 ] + stock → item + stock → item [ item 2 ] + item → id , qty , ( price + item *) item 1 → id [ id ], qty [ qty ], price [ price ] item 2 → id [ id ], qty [ qty ], item [ item 2 ]*

Inference of XSDs XML XSD • Problem: infer XSD from XML corpus • Requirement: concise, i.e., humans can interpret/validate • But… theorem [Gold, 1967]: impossible to learn from positive data only

XSD property W3C specs: Element Declarations Consistent (EDC): no elements with distinct type in same content model sometype → item [ item 1 ] + , item [ item 2 ] + content model of an element depends on its context

XML validation for XSD [ store ] store [ order ] [ order ] [ stock ] order order stock [ item 2 ] [ customer ] [ item 1 ] [ item 1 ] [ item 2 ] customer item item customer item item item id qty price id qty price id qty price id qty item id qty [ id ] [ qty ] [ price ] [ id ] [ qty ] [ item 2 ] id qty XSD: root → store [ store ] store → order [ order ]*, stock [ stock ] if XML is valid: order → customer [ customer ], item [ item 1 ] + type assignment is determined stock → item [ item 2 ] + by path from element to root item 1 → id [ id ], qty [ qty ], price [ price ] item 2 → id [ id ], qty [ qty ], item [ item 2 ]*

XML validation for XSD Theorem [Martens et al., 2006] Content model of an element is uniquely determined by the path from the root to that element

XSD observations: local context • Large, diverse corpus of real world XSDs [Bex et al., 2004, Martens et al., 2006] – 98 % of XSDs only local context: relevant ancestor path has length of at most 3, i.e., "greatgrandfather" store order item id qty price

XSD observations: SOREs • Large, diverse corpus of real world XSDs [Bex et al., 2004, Martens et al., 2006] – 99 % of regular expressions is single occurrence • What’s a Single Occurrence RegExp header, protein, organism, reference*, comment*, genetics*, complex*, function*, classification?, keywords?, feature*, summary, sequence authors, citation, volume?, month?, year, pages?, (title + descr)?, xrefs? title, (author, affiliation?) + , abstract • … and what’s not title, ((author, affiliation) + + (editor, affiliation) + ), abstract duplicate element names

Main result Theorem: XSDs with local context and SORE content models are learnable from positive examples only

Algorithm i L OCAL store store corpus � order order stock stock customer item item customer item item item item id qty price id qty price id qty price id qty id qty item item id qty → {store} λ store → {order order stock, stock} id qty item item id qty store/order → {customer item item, customer item} store/stock → {item, item item} id qty id qty store/order/item → {id qty price} store/stock/item → {id qty, id qty item item} store/stock/item/item → {id qty item item, id qty} store/stock/item/item/item → {id qty} paths are types [Martens et al., 2006]

Algorithm i L OCAL → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} store/order/item → {id qty price} store/stock/item → {id qty, id qty item item} store/stock/item/item → {id qty item item, id qty} store/stock/item/item/item → {id qty} locality: k = 2 → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} order/item → {id qty price} stock/item → {id qty, id qty item item} item/item → {id qty item item, id qty}

Algorithm i Local → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} order/item → {id qty price} stock/item → {id qty, id qty item item} item/item → {id qty item item, id qty} i SOA, ToSORE [Bex et al., 2006] XSD → store [ store ] → store [ store ] �� → order [ store/order ]*, stock [ store/stock ] → order [ store/order ]*, stock [ store/stock ] store store → customer [ order/customer ], item [ order/item ] + → customer [ order/customer ], item [ order/item ] + store/order store/order → item [ stock/item ] + → item [ stock/item ] + store/stock store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item order/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* → id [ item/id ], qty [ item/qty ], item [ item/item ]* stock/item stock/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* → id [ item/id ], qty [ item/qty ], item [ item/item ]* item/item item/item

Algorithm i L OCAL • Theorem: i L OCAL is sound corpus � is valid with respect to inferred XSD • Theorem: i L OCAL is k -complete if corpus � is "sufficiently large" then target XSD is equivalent with inferred XSD

Algorithm M INIMIZE → store [ store ] �� → order [ store/order ]*, stock [ store/stock ] store → customer [ order/customer ], item [ order/item ] + store/order → item [ stock/item ] + store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* stock/item duplicate → id [ item/id ], qty [ item/qty ], item [ item/item ]* item/item types M INIMIZE → store [ store ] �� → order [ store/order ]*, stock [ store/stock ] store → customer [ order/customer ], item [ order/item ] + store/order → item [ item 2 ] + store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item → id [ item/id ], qty [ item/qty ], item [ item 2 ]* item 2

In practice: incomplete data store corpus � stock/item → {id qty, id qty item item} stock item/item → {id qty item, id qty} i Local, k = 2 item item i SOA, ToSORE id qty id qty item item → id [ item/id ], qty [ item/qty ], stock/item item [ item/item ]* id qty item id qty → id [ item/id ], qty [ item/qty ], item/item item [ item/item ]? id qty M INIMIZE can't minimize! incomplete data �⇒ i Local derives too many types!

Practical heuristics • Define "distance" between types – details: see paper • For types � , � : if �� ε , unify � and � = R EDUCE • Our practical algorithm i XSD: � ��

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank - PowerPoint PPT Presentation

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank Neven and Stijn Vansummeren Hasselt University and transnational University of Limburg, Belgium Overview Introduction Complete algorithm i L OCAL Heuristic i XSD

IP-XACT XML Schema Vanderlei Bonato Sep 2008 Outline XML Schema The seven top-level

Linked Open Data data.slub-dresden.de Linked Open Usable Data data.slub-dresden.de schema.org

Schema Languages Schema Languages Regular expressions a commonly used formalism in schema

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

XML data exchange Amlie Gheerbrant LFCS University of Edinburgh 11/11/2010 - Dagstuhl

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Schema and alternatives Patryk Czarnik XML and Applications 2014/2015 Lecture 4

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Schema validation and evolution for PGs Eugenia Oshurko (ENS Lyon) 7 March 2019 Main ideas

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &

A Simula)on of Document Detec)on Methods and Reducing False

Interest Points Computer Vision Jia-Bin Huang, Virginia Tech Many slides from N Snavely, K.

Lab 8: Firewalls & Intrusion Detection Systems Fengwei Zhang SUSTech CS 315 Computer

CS 5410 - Computer and Network Security: Intrusion Detection Professor Kevin Butler Fall 2015

Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes 1,2 Bolin Ding 1 Jiawei

Alert classification to reduce false positives in intrusion detection P h D D e f e n s e P r e

I2RS RIB Route Example Sue Hares i2RS Client config Client Hackathon NETCONF CLI/GUI with

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank - PowerPoint PPT Presentation

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank Neven and Stijn Vansummeren Hasselt University and transnational University of Limburg, Belgium Overview Introduction Complete algorithm i L OCAL Heuristic i XSD

IP-XACT XML Schema Vanderlei Bonato Sep 2008 Outline XML Schema The seven top-level

Linked Open Data data.slub-dresden.de Linked Open Usable Data data.slub-dresden.de schema.org

Schema Languages Schema Languages Regular expressions a commonly used formalism in schema

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

XML data exchange Amlie Gheerbrant LFCS University of Edinburgh 11/11/2010 - Dagstuhl

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Schema and alternatives Patryk Czarnik XML and Applications 2014/2015 Lecture 4

XML Documents XML Documents The XML Namespace mechanism Anders Mller &amp; Michael I.

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Schema validation and evolution for PGs Eugenia Oshurko (ENS Lyon) 7 March 2019 Main ideas

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &amp;

A Simula)on of Document Detec)on Methods and Reducing False

Interest Points Computer Vision Jia-Bin Huang, Virginia Tech Many slides from N Snavely, K.

Lab 8: Firewalls &amp; Intrusion Detection Systems Fengwei Zhang SUSTech CS 315 Computer

CS 5410 - Computer and Network Security: Intrusion Detection Professor Kevin Butler Fall 2015

Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes 1,2 Bolin Ding 1 Jiawei

Alert classification to reduce false positives in intrusion detection P h D D e f e n s e P r e

I2RS RIB Route Example Sue Hares i2RS Client config Client Hackathon NETCONF CLI/GUI with

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &

Lab 8: Firewalls & Intrusion Detection Systems Fengwei Zhang SUSTech CS 315 Computer