cs490w semi structured data
play

CS490W Semi-Structured Data Structure of XML XML data is organized - PDF document

CS490W Semi-Structured Data Structure of XML XML data is organized by documents like unstructured data XML data and Retrieval There are structures (nodes/tags) within the documents Each XML document is an ordered, labeled tree


  1. CS490W Semi-Structured Data Structure of XML � XML data is organized by documents like unstructured data XML data and Retrieval � There are structures (nodes/tags) within the documents � Each XML document is an ordered, labeled tree � Element Nodes are labeled with Luo Si � Node name (e.g., chapter) Department of Computer Science � Node attributes and the values (e.g., size=1000; Purdue University time=01/01/2007) � May have child nodes or data � Data exist (e.g., text strings) within leaf nodes XML and Retrieval: Outline XML Example Outline: <book id=“ML_Tom”> <title> Machine Learning </title> � Semi-Structure Data <author> <firstname> Tom </firstname> � XML, Examples, Application <surname> Mitchell< /surname> </author> � XML Search ... � XQuery <p> Machine Learning Applications. ..</p> ... � XIRQL </book> � Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) � Vector-space model � INEX Semi-Structured Data XML Example XML has been used as the standard representation of Semi- <book id=“ML_Tom”> <title> Machine Learning </title> Structured Data <author> <firstname> Tom </firstname> � e X tensible M arkup L anguage <surname> Michael </surname> book </author> is a W3C-recommended general-purpose markup language that supports a wide ... title <p> Machine Learning Applications. ..</p> variety of applications. ... � A framework for defining markup languages author </book> firstname surname � Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter � Each set of XML corresponds to different applications title para para para � facilitate the sharing of data across different information systems, particularly systems connected via the Internet � Examples: RSS, XHTML, MathML

  2. Elements Why XML? � Elements are defined by markup tags � Unlike relational database, XML data does not require relational schemata, etc., because the data itself contains � Elements: <TagName attr_a=“value”…>text</TagName> this information. � ID of the element is TagName � Unlike widely used Web format, HTML, which only ensures � Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also � Data/text: “text” guarantees total usability of data. � End tag </TagName> XML, HTML, SGML XML Applications 1986: SGML ISO 8879-1986 � CML – chemical markup language: Nov 1995: HTML 2.0 � WML – wireless markup language Nov 1996: Simplified and stripped down SGML draft � ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML XML Applications � Both of them are derivations of SGML � CML – chemical markup language: � HTML is a markup language mainly for display in browsers CML ( C hemical M arkup L anguage) is a new approach to managing � XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first � HTML defines display domain specific implementation based strictly on XML, � XML defines the data structure, the display factor is separated from the content <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> � HTML can be formalized as XML (XHTML)

  3. XML Applications XML Files � <?xml version="1.0"?> � WML – wireless markup language <!DOCTYPE note [ Wireless Markup Language , is a content format for devices that implement the <!ELEMENT note (to,from,heading,body)> Wireless Application Protocol (WAP) specification, such as mobile phones. <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <?xml version="1.0"?> <!ELEMENT heading (#PCDATA)> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" "http://www.phone.com/dtd/wml11.dtd" > <!ELEMENT body (#PCDATA)> <wml> ]> <card id="main" title="First Card"> <note> <p mode="wrap">This is a sample WML page.</p> <to>Tove</to> XML Document </card> <from>Jani</from> </wml> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Applications XML Files � XML Schema: � ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally <ThML> referred to by the initialism for XML Schema instances, XSD (XML <ThML.body> Schema Definition). XSDs are far more powerful than DTDs in describing – <div1> XML languages . � <div2 title="Genesis" id="Gen"> <xs:schema – <div3 title="Chapter 1"> • <p> xmlns:xs="http://www.w3.org/2001/XMLSchema"> • <scripture/> <xs:element name="country" type="Country"/> • In the beginning God created the heaven and the earth. • <scripture/> <xs:complexType name="Country"> • And the earth was without form, and void; and darkness was upon the face of the deep. <xs:sequence> And the Spirit of God moved upon the face of the waters. • </p> <xs:element name="name" type="xs:string"/> – </div3> � </div2> <xs:element name="population" type="xs:decimal"/> – </div1> </xs:sequence> </ThML.body> </xs:complexType> </ThML> </xs:schema> XML Files XML Search � Schema/DTD: syntax definition of XML Language; � Most XML Search protocols use a database-based approach Document Type Definition (DTD file) � Non-text data match XML provides an application independent way of sharing data. With a DTD, � Exact keyword (text) match independent groups of people can agree to use a common DTD for interchanging data. However, this is often NOT the case � Evaluate XML path expression <?xml version="1.0"?> � No concept of relevant <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]>

  4. XML Search Principal Forms � Traditional XML Search from Database-based approach � Path Query � XQuery /book//title contains “Information Retrieval” � Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval” a book); ids (ISBN of book); keyword match (text) � Conditional expressions � XML text search from information retrieval approach $h/title, � XIRQL IF $h/@type = "Journal" THEN …. � Vector-space based if the type of an article is journal � Search text data: estimate relevance of xml elements with respect of query � Query may contain path expressions XML Search Flowers (FLWR) � XQuery � Programming Language: Flowers (FLWR) expression � SQL for XML The programming language XQuery defines FLWOR or FLWR (often pronounced as 'flower') as expression that supports � Used for text-rich documents; data-oriented documents iteration and binding of variables to intermediate results. (non-text); mixed documents � For and let create a sequence of tuples � Consider: path expression (XPath); XML Schema � where filters the tuples on a boolean expression datatypes � order by sorts the tuples, using any comparable data � It is still a working draft; details are being improved � return gets evaluated once for every tuple XML Search Flowers (FLWR) for $d in document("depts.xml")//deptno � XQuery considers some principal forms let $e := document("emps.xml")//employee[deptno = $d] � Path expression where count($e) >= 10 � Conditional expressions order by avg($e/salary) descending � Datatype expressions return <big-dept> { $d, <headcount>{count($e)}</headcount>, � List expression <avgsal>{avg($e/salary)}</avgsal> } � etc </big-dept> � Programming Language: Flowers (FLWOR) expression � Principle forms can be evaluated with respect to context

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend