outline
play

Outline XML Documents: Quick Review on XML Limitations and - PDF document

Relational Databases for Querying Outline XML Documents: Quick Review on XML Limitations and Opportunities The Problem Definition General Approach to Querying XML using a Jayavel Shanmugasundaram RDBMS Kristin Tufte Gang He


  1. Relational Databases for Querying Outline XML Documents: • Quick Review on XML Limitations and Opportunities • The Problem Definition • General Approach to Querying XML using a Jayavel Shanmugasundaram RDBMS Kristin Tufte Gang He • The focus: XML DTD to Relational Schema Chun Zhang – General idea and problems David DeWitt – 3 inlining techniques Jeffrey Naughton • Basic • Shared Presented by Mingwei JI • Hybrid Discussion led by Atsushi WATASE • Summary What is the big deal about XML? XML in One Slide • eXtensible Markup Language (XML) • Fast emerging as dominant standard for data • Hierarchical document format for information representation on WWW exchange in WWW • Self describing data (tags) • Exciting database opportunity: • Semi-structured – Unlike HTML, tags are not only for presentation • Nested element structure having a root – Can capture semantics • Element data can have – Can query the web if we can query XML!!! – Attributes – Sub-elements XML Example DTDs: Schema for XML Docs <book> <!ELEMENT book (booktitle, author)> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <!ELEMENT booktitle (#PCDATA)> <name> <firstname> Richard </firstname> <!ELEMENT author (name, address)> <lastname> Dawkins </lastname> <!ATTLIST author id ID #REQUIRED> </name> <address> … </address> </author> </book> 1

  2. The Problem Approach • Given: • Process a DTD to generate a relational – DTDs schema ( our focus today ) – Collection of XML documents conforming • Parse XML documents conforming to to DTDs DTDs and load them into tuples of • Query: RDBMS – Based on DTD schemas • Translate semi-structured queries over – Over collection of XML documents, XML documents into SQL queries over performing selections, joins, etc. RDBMS – Producing an XML result • Convert the results back to XML Approach DTD to Relational Schema • General Approach: XML XML-QL XML DTD Documents Query Result – Each Element ==> Relation Automatic Translation Layer – Each Attribute of Element ==> Column of Relation Relational SQL Relational Tuples Schema – Connect elements using foreign keys Query Result Translation • Problem? Information – Fragmentation! Commercial RDBMS Fragmentation: Example DTD to Relational Schema <!ELEMENT author (name, address)> author (authorID: integer, id: string) <!ATTLIST author id ID #REQUIRED> name (nameID: integer, authorID: integer) <!ELEMENT name (firstname?, lastname)> firstname (firstnameID: integer, nameID: integer, value: string) • “inlining” <!ELEMENT firstname (#PCDATA)> lastname (lastnameID: integer, nameID: integer, <!ELEMENT lastname (#PCDATA)> value: string) “Put as many descendants or sub-elements of <!ELEMENT address ANY> address (addressID: integer, authorID: integer, value: an element as possible into a single relation” string) • Results in 5 relations • Just retrieving first and last names of an author requires three joins! 2

  3. book monograph article DTD Graph Basic Inlining Technique • Intuition: ? title booktitle – Inline as many sub-elements as possible * editor contactauthor – Do not inline only if it is a set sub-element. * – Connect relations using foreign keys authorID • Complications: author name – A document can be rooted at any element • Create separate relational schema for each name authorid address root – Recursion ? • Detect cycles in schema firstname lastname Basic : Relational Schema Basic Inlining Technique: Pros book (bookID: integer, book.booktitle : string, book.author.name.firstname: string, • Reduces number of joins for queries like book.author.name.lastname: string, “get the first and last names of a book book.author.address: string, author.authorid: string) author” … • Efficient for queries such as “list all author (authorID: integer, authors of books” author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) … address (…) Basic Inlining Technique: Cons Experimental Results • Queries like “list all authors with name • Is Basic practical? Jack” – Many DTDs have recursion – Union of 5 queries! – Large strongly connected components • Generate a Large number of relations: – Schema translation program ran out of virtual memory!!! – Separate relational schema for each element as root • Concentrate on Shared vs. Hybrid 3

  4. Shared Inlining Technique Shared Inlining Technique • Intuition: The principal idea behind “ Shared” – Inline as many sub-elements as possible – To identify the element nodes that are – Do not inline only if it is a shared, recursive represented in multiple relations in “ Basic ” or set sub-element. and to share them by creating separate • Technique: relations for these elements. – Necessary and Sufficient Condition for – To ensure that an element node is shared/ recursive element: represented in exactly one relation In-degree >= 2 in DTD graph Shared: Relational Schema Shared Inlining Techniques: Pros book (…) article (…) monograph (…) title (…) + Reduces number of joins for queries like author (authorID: integer, “get the first and last names of an author” author.parentID: integer, + Efficient for queries such as “list all author.parentCODE: integer, authors with name Jack” author.name.isroot: boolean, author.name.firstname.isroot: :boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string) Shared Inlining Technique: Cons Hybrid Inlining Technique - Sharing whenever possible implies extra • Inlines some elements that are shared in joins for path expressions Shared • “Article with a given title name” – Elements with in-degree >= 2 that are not set sub-elements or recursive • Handles set and recursive sub-elements as in Shared 4

  5. Hybrid: Relational Schema Hybrid Inlining Technique: Pros book (…) + Reduces joins through shared elements article (…) monograph (…) (that are not set or recursive elements) author (authorID: integer, author.parentID: integer, + Shares some strengths of Shared : author.parentCODE: integer, • Reduces joins for queries like “get first and author.name.isroot: boolean, author.name.firstname.isroot: boolean, last names of a book author” author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string) Hybrid Inlining Technique: Cons Group Discussion • Q1. Assume we’ll store XML documents into database. - Requires more SQL sub-queries to Which one would you pick and why? retrieve all authors with first name Jack – Create an XML database from scratch • Tradeoff between reducing number of queries – Use a relational backend and reducing number of joins • Q2. In what condition, would you pick another one? • Shared and Hybrid target query- and join- • Q3. Assume we pick “relational” to the Q1. Which inlining reduction respectively technique would you use? – Basic – Shared – Hybrid Summary Reference • Relational model can be used to query • http://www.cs.wisc.edu/~jai/pubs.html XML documents by Jayavel Shanmugasundaram • 4 steps • Converting XML DTD to relations: – General idea – 3 Inling Techniques • Basic • Shared • Hybrid 5

  6. Thank You! 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend