XML and Web Data Data in HTML HyperText Markup Language - - PDF document
XML and Web Data Data in HTML HyperText Markup Language - - PDF document
XML and Web Data Data in HTML HyperText Markup Language Different data elements are set out using tags No schema? Based on the data itself, we can make a reasonable guess about the structure Self-describing CMPT
CMPT 354: Database I -- XML 2
Data in HTML
- HyperText Markup
Language
– Different data elements are set out using tags
- No schema?
– Based on the data itself, we can make a reasonable guess about the structure – “Self-describing”
CMPT 354: Database I -- XML 3
Object and Schema
CMPT 354: Database I -- XML 4
Semi-structured Data
- Object-like: it can be represented as a
collection of objects
- Schemaless: it is not guaranteed to conform
to any type structure
- Self-describing
– Often carries only the names of the attributes and has a lower degree of organization than the data in the database
- Semi-structured data: data with the above
characteristics
CMPT 354: Database I -- XML 5
Schemaless But Self-Describing
(#12345, [ListName:“Students”, Contents:{ [Name:“John Doe”, ID:“111111111”, Address:[Number:123, Street:“Main St”] ], [Name:“Joe Public”, Id:“666666666”, Address:[Number:666, Street:“Hollow Rd”] ]} ] )
CMPT 354: Database I -- XML 6
XML
- Extensible Markup Language
– A standard adopted in 1998 by the W3C (World Wide Web Consortium)
- Optional mechanisms for specifying document
structure
– DTD: the Document Type Definition Language, part of the XML standard – XML Schema: a more recent specification built on top of XML
- Query languages for XML
– XPath: lightweight – XSLIT: document transformation language – XQuery: a full-blown language
CMPT 354: Database I -- XML 7
From HTML to XML
CMPT 354: Database I -- XML 8
HTML and XML
- HTML
– A fixed number of tags – Each tag has its own well-defined meaning
- E.g., <table> … </table>
- XML: HTML-like language
– An arbitrary number of user-defined tags – No a priori semantics – Mainly for data exchange – Display using stylesheet
CMPT 354: Database I -- XML 9
Important Differences
- XML contains a large assortment of tags chosen
by the document author
– The only valid tags in HTML are those sanctioned by the official specification of the language; other tags are ignored
- Every opening tag must have a matching closing
tag, and the tags must be properly nested
– E.g., <a><b></a></b> is not allowed – Some HTML tags are not required to be closed, e.g., <p>
- The document has a root element – the element
that contains all other elements
CMPT 354: Database I -- XML 10
Example
Root element Mandatory statement XML elements Element names Element contents
CMPT 354: Database I -- XML 11
Hierarchical Structure
PersonList Student Title Contents Person Person Name: John Doe Id: 111111111 Address Number: 123 Street: Main St Name: Joe Public Id: 666666666 Address Number: 666 Street: Hollow Rd
CMPT 354: Database I -- XML 12
Attributes
- <PersonList Type=“Student”>
– Type is the name of an attribute that belongs to the element PersonList – Student is the attribute value – All attribute values must be quoted – Text strings between tags do not need to be quoted
- Empty element
– <Title Value=“Student List”/> – The element has one attribute and no content – A shorthand for <Title Value=“Student List”></Title>
CMPT 354: Database I -- XML 13
Processing Instructions & Comments
- Processing instructions
– <?xml version=“1.0” ?> – Contain anything the author might want to communicate to the XML processor, e.g., <?my-command go bring coffee?> – Rarely used
- Comment
– <!-- A comment --> – Can occur everywhere except inside the markups, i.e., between symbols < and > – An integral part of the document – May be used by a receiver (e.g., a browser)
CMPT 354: Database I -- XML 14
CDATA Construct
- Include strings of characters which contain
markup elements that might make the document ill formed
- <![CDATA[ This is an example of markup in
HTML: <b><i> Example <\b><\i>]]>
CMPT 354: Database I -- XML 15
XML Elements and Data Objects
- XML allows mixed data/text structure
- XML elements are ordered
- XML has only one primitive type, string, and
very weak facilities for specifying constraints
<Address> <Number> 123 </Number> <Street> Main St </Street> </Address> is different from <Address> <Street> Main St </Street> <Number> 123 </Number> </Address> A legal XML document <Address> Sally lives on <Street> Main St </Street> house number <Number> 123 </Number> in the beautiful Anytown, Canada. </Address>
CMPT 354: Database I -- XML 16
Use of Attributes
- An element can have any number of user-defined
attributes
- What attributes can do can also be achieved with elements
– An attribute may occur only once within a tag, while subelements with the same tag may be repeated
- Attributes introduce ambiguity as to whether to represent
information as attributes or elements
– Sometimes convenient for representing data, can also be done with elements – The use of attributes is expected to decline
<Address> <Number> 123 </Number> <Street> Main St </Street> </Address> <Address Number=“123” Street=“Main St/>
CMPT 354: Database I -- XML 17
Attributes in Markup
<Act Number=“5”> <Scene Number=“1” Place=“Mantua. A street”> … <Apothecary Voice=“scared”> Such mortal drugs I have; but Mantua’s law Is death to any he that utters them. </Apothecary> <Romeo Voice=“persistent”> Art thou so bare and full of wretchedness, And fear’st to die? … </Romeo> … </Scene> </Act>
CMPT 354: Database I -- XML 18
Advantages of Attributes
- Attributes in an element are not ordered
– <Address Number=“123” Street=“Main St”/> – <Address Street=“Main St” Number=“123”/>
- Attributes are more succinct
- Attributes can be declared to have unique value
and can be used to enforce limited kind of referential integrity
<Address> <Number> 123 </Number> <Street> Main St </Street> </Address>
CMPT 354: Database I -- XML 19
ID and IDREF – Cross-References
CMPT 354: Database I -- XML 20
Well Formed XML Document
- It has a root element
- Every opening tag is followed by a matching
closing tag, and the elements are properly nested inside each other
- Any attribute can occur at most once in a
given opening tag, its value must be provided, and this value must be quoted
CMPT 354: Database I -- XML 21
Namespaces
- A term (tag) might have different meanings in
different contexts
– <name><First>John</First> <Last>Doe</Last></Name> – <Name>Simon Fraser University</Name>
- Every XML tag must have two parts: namespace
and local name
– General structure: namespace:local-name – Namespace represented by URI (uniform resource identifier)
- An abstract identifier (a general unique string)
- URL (uniform resource locator)
CMPT 354: Database I -- XML 22
Example – Namespace
- Namespaces are defined using the attribute xmlns
– All names xml* should be considered reserved
- Default namespace xmlns=“…”
– Only one default namespace
- Other namespace xmlns:toy=“…”
– Prefixes (e.g., toy) must be distinct
<item xmlns=“http://www.acmeinc.com/jp#supplies” xmlns:toy=“http://www.acmeinc.com/jp#toys”> <name>backpack</name> <feature> <toy:item> <toy:name>cyberpet</toy:name> </toy:item> </feature> </item>
CMPT 354: Database I -- XML 23
Namespace Declarations
- Namespace as prefix
– E.g., toy:item, toy:name – Tags without prefix belong to the default namespace
- Namespace declarations have scope
– Can be nested like a program block
CMPT 354: Database I -- XML 24
Example – Scopes of Namespaces
<item xmlns=“http://www.acmeinc.com/jp#supplies” xmlns:toy=“http://www.acmeinc.com/jp#toys”> <name>backpack</name> <feature> <toy:item> <toy:name>cyberpet</toy:name> </toy:item> </feature> <item xmlns=“http://www.acmeinc.com/jp#supplies2” xmlns:toy=“http://www.acmeinc.com/jp#toys2”> <name>notebook</name> <toy:name>sticker</toy:name> </item> </item>
CMPT 354: Database I -- XML 25
More About Namespace
- The name of a namespace is just a string
that happens to be a URL
- Not necessarily it is a real address that
contains some kind of schema describing the corresponding set of names
- Don’t be misled by the URL!
CMPT 354: Database I -- XML 26
Summary
- HTML and XML: differences and
applications
- Structure of XML
– Elements – Attributes – Well formed XML documents
- Namespace
CMPT 354: Database I -- XML 27
To-Do-List
- Can every relational table be represented in
XML? Can every XML document be represented in a relational table?
- RSS is an application of XML. Try to