XML Walking the Tree Modifying the Tree Generating XML Documents - - PowerPoint PPT Presentation

xml
SMART_READER_LITE
LIVE PREVIEW

XML Walking the Tree Modifying the Tree Generating XML Documents - - PowerPoint PPT Presentation

SSC1: XML Volker Sorge Overview XML Format Document Structure XML Components Software Systems Components 1 Tree Parsing XML Parsing Documents XML Walking the Tree Modifying the Tree Generating XML Documents Creating Documents Volker


slide-1
SLIDE 1

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Software Systems Components 1

XML

Volker Sorge

http://www.cs.bham.ac.uk/~vxs/teaching/ssc1

slide-2
SLIDE 2

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Topic Overview

◮ The XML format: Document structure and its

interpretation.

◮ Tree Parsing XML: JDom and walking trees. ◮ Validating and generating XML ◮ Stream Parsing XML: SAX/StAX ◮ Using XML to format data

slide-3
SLIDE 3

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

What is XML

◮ XML = eXtensible Markup Language ◮ Markup languages are structured representations of text

(or data)

◮ they contain text, plus information about the structure

  • f that text

◮ XML is a descendant from SGML (Standard

Generalized Markup Language) that was developed in the 70s to describe document structure

◮ Therefore XML files are called documents, regardless of

their content.

◮ Related languages are, for example, HTML.

slide-4
SLIDE 4

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Example

<?xml version="1.0"?> <!DOCTYPE Configuration SYSTEM "../../conf.dtd"> <configuration> <title > <font> <name>Helvetica</name> <size unit="pt">36</size> </font> </title> <body> <font> <name>Times Roman</name> <size unit="pt">12</size> </font> </body> <window> <width unit="px">400</width> <height unit="px">200</height> </window> <menu> <item>Times Roman</item> <item>Helvetica</item> <item>Goudy Old Style</item> </menu> </configuration> <!−− The end −−>

Header Root Element Tags End Tags Elements Comment

slide-5
SLIDE 5

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Structure of XML Documents

◮ XML documents are structured as trees. ◮ The structure is given using tags that contain child

elements.

◮ The single root of the tree is given by the Root Element. ◮ Leafs consist of plain text enclosed by tags.

slide-6
SLIDE 6

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Example (revisited)

<?xml version="1.0"?> <!DOCTYPE Configuration SYSTEM "../../conf.dtd"> <configuration> <title > <font> <name>Helvetica</name> <size unit="pt">36</size> </font> </title> <body> <font> <name>Times Roman</name> <size unit="pt">12</size> </font> </body> <window> <width unit="px">400</width> <height unit="px">200</height> </window> <menu> <item>Times Roman</item> <item>Helvetica</item> <item>Goudy Old Style</item> </menu> </configuration> <!−− The end −−>

slide-7
SLIDE 7

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Structure of Example Documents

Configuration Title Font Name Helvetica Size 36 Body Font Name Times Roman Size 12 Window Width 400 Height 200 Menu Item Times Roman Item Helvetica Item

  • Goudy. . .
slide-8
SLIDE 8

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

XML Main Components

The main components of an XML document are elements.

◮ They are enclosed by an open and a closing tag. (Tags

can be viewed as “named brackets”)

◮ They can contain ordinary text. ◮ Elements can have in turn child elements. ◮ They can have additional attribute assignments. <font> <name>Helvetica</name> <size unit="pt">36</size> </font>

is one element font containing two child elements name and

  • size. The latter does contain one attributer unit="pt".
slide-9
SLIDE 9

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Mixed Content

◮ It is legal that elements contain both text and child

elements.

◮ This is called mixed content. ◮ However mixed content should be avoided as it

◮ Obscures the structure of the document. ◮ Makes parsing the document harder.

◮ Thus try not to have for example: <font> Helvetica <size>36<\size> <\font>

slide-10
SLIDE 10

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Elements vs. Attributes

◮ When designing an XML document, you often have to

decide between using elements or attributes to represent information.

◮ General rule: use elements! ◮ Attributes should be used sparsely. ◮ Try to only use attributes for names used as identifiers.

slide-11
SLIDE 11

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Other XML Components

Processing Instructions are delimited by <? and ?> Example: The header information <?xml version="1.0"?>

◮ They contain information for whatever program

processes the document.

◮ The only on you need to know is the header above. ◮ You may also see xml-stylesheet, or php. ◮ Strictly speaking, it is optional, but you should

always include it.

slide-12
SLIDE 12

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Other XML Components

Comments Comments delimited by <!−− and −−> Example: <!−− The end −−>

◮ Cannot contain ‘−−’. ◮ Don’t hide commands in comments!

slide-13
SLIDE 13

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Other XML Components

Character References Denote unicode characters by decimal

  • r hex-code, e.g., &#x40;

Entity References Denote special characters name, e.g., &gt; DTD Document type definition, which offers a mechanism for validation. Example:

<!DOCTYPE Configuration SYSTEM "../../conf.dtd">

We will see some details on that later.

slide-14
SLIDE 14

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

XML vs HTML

Although XML and HTML are closely related, there are some significant differences:

◮ XML is case sensitive, HTML is not. ◮ HTML can have attributes without values. In XML

each attribute has to have a value.

◮ HTML is forgiving, XML is not!

slide-15
SLIDE 15

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

XML vs HTML

Although XML and HTML are closely related, there are some significant differences:

◮ XML is case sensitive, HTML is not. ◮ HTML can have attributes without values. In XML

each attribute has to have a value.

◮ HTML is forgiving, XML is not!

◮ In HTML one can omit end tags (e.g. </p>). In XML

this is an error.

◮ Singular tags need a trailing slash:

<img src="pic.jpg"/>

◮ Attribute values must be enclosed in quotation marks in

XML.

slide-16
SLIDE 16

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

XML vs HTML

Although XML and HTML are closely related, there are some significant differences:

◮ XML is case sensitive, HTML is not. ◮ HTML can have attributes without values. In XML

each attribute has to have a value.

◮ HTML is forgiving, XML is not!

In general one could say:

◮ HTML render engines (e.g., a browser) will attempt to

interpret what it can, ignoring malformed syntax.

◮ XML parsers validate the structure and produce errors.

slide-17
SLIDE 17

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Parsing XML Documents

◮ There are several ways to parse XML documents.

◮ Tree parsing: Load the entire document into a tree

structure.

◮ Stream parsing: Load a document sequentially.

◮ We will first discuss Tree parsing and its parsers:

DOM The original XML parser by Sun. We only very briefly mention it for historical reasons. JDOM Current state of the art parser. A third party Java library provided by jdom.org.

slide-18
SLIDE 18

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

DOM

DOM is the original parser of XML documents, which has a substantial number of drawbacks. Here are just a few:

◮ Documents can not be created directly but only via a

DocumentBuilderFactory.

◮ Simple text is considered to be a node separate from its

containing element.

◮ Elements contain non-document child nodes, in

particular whitespace. E.g.

<font> <name>Helvetica<\name> <size>36<\size> <\font>

Element Font Text Whitespace Element Name Text Helvetica Text Whitespace Element Size Text 36 Text Whitespace

slide-19
SLIDE 19

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

DOM

DOM is the original parser of XML documents, which has a substantial number of drawbacks. Here are just a few:

◮ Child nodes of an element are given in a special special

interface, which implements a collection that does not lend itself to iteration.

◮ There is no distinction between different type of

document elements. In particular, there is no reflection

  • f the XML components, e.g., elements, comments,

etc., in DOM’s class structure.

slide-20
SLIDE 20

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

DOM

DOM is the original parser of XML documents, which has a substantial number of drawbacks. Here are just a few:

◮ Child nodes of an element are given in a special special

interface, which implements a collection that does not lend itself to iteration.

◮ There is no distinction between different type of

document elements. In particular, there is no reflection

  • f the XML components, e.g., elements, comments,

etc., in DOM’s class structure. DOM can still be useful when one needs to deal with the “physical” structure of an XML documents that includes whitespace, etc. For example, when formatting and cleaning up XML documents.

slide-21
SLIDE 21

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

JDOM

◮ Convenient interface to load and create XML

documents.

◮ Standard facilities to navigate in the document tree. ◮ Straight forward library functions: get, set, add, remove. ◮ Node classes reflects the different components of XML

documents.

slide-22
SLIDE 22

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Loading an XML document

◮ Create a SAXBuilder to parse the XML document: SAXBuilder builder = new SAXBuilder(false);

The constructor argument should be true only if you want to validate the XML file against a DTD. (We will discuss SAX and DTDs later.)

◮ Read the XML document into a corresponding variable: Document readDoc = builder.build(File xmlFile);

◮ You should encapsulate this into a try block in case of

IOException.

◮ You can get XML documents from other sources: URL,

strings, etc.

slide-23
SLIDE 23

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Accessing Content

◮ The content of an XML document is the actual

document tree together with comments, processing instructions, etc.

◮ The content of the entire document can be accessed

using the getContent() method. This returns a list of Content objects.

◮ Content class is a superclass for all types of content, in

particular, elements, comments, processing instructions, etc.

slide-24
SLIDE 24

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

JDOM Tree Structure

<?xml version="1.0"?> <configuration> <title > <font> <name>Helvetica</name> <size>36</size> </font> </title > <body> <font> <name>Times Roman</name> <size>12</size> </font> </body> </configuration>

Doc Configuration Title Body Font Font Helvetica Name Size 36 Size 12 Times Roman Name Document Element Content Element Content Content RootElement

slide-25
SLIDE 25

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Traversing the Tree

As the tree does not have a fixed arity each node contains a list of children.

◮ First we get the root element Element root = doc.getRootElement(); ◮ then a list of all its children List allChildren = root.getChildren(); ◮ then an iterator for the lest Iterator it = allChildren.iterator(); ◮ Now we can perform operations on the list. while (it.hasNext()) { Element elt = (Element)it.next(); // Process element elt }

slide-26
SLIDE 26

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Accessing Elements

In addition to accessing the children of an element, there are methods to access its content: getName() The name of the element, i.e. the Tag name. getAttributes() This returns the complete set of attributes for this element, as a List of Attribute

  • bjects in no particular order, or an empty list if there

are none. getText() The text that might be contained in the element. There are analogous set methods: setName(), setAttributes(), setAttribute(), setText().

slide-27
SLIDE 27

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Example

Here’s an example of a depth first traversal:

public static void DepthFirst(Element node) { System.out.printf("Name: %10s\t\tAttribute: %10s\nText: %s\n-------\n", node.getName(),node.getAttributes(),node.getText()); for (Iterator iter = (node.getChildren()).iterator(); iter.hasNext();) { DepthFirst((Element)iter.next()); } }

Observe that we always have to cast the list elements to the right class.

slide-28
SLIDE 28

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Accessing Attributes

There are two ways of accessing attributes.

  • 1. Directly from the element.
  • 2. By going over the list of attributes retrieved with

getAttributes(). The following methods access attributes from elements: getAttribute(String x) Returns the single attribute with name x. Null if no such attribute exists. getAttributeValue(String ) Returns the value of attribute with name x. Null if no such attribute exists.

slide-29
SLIDE 29

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Accessing Attributes

Here are some methods dealing with objects of the Attribute class: getName() Returns the attribute name as string. setName(x) Sets the attribute name to string x. getValue() Access the attribute values as string. setValue(x) Sets the attribute value to the string x. get Value() Accesses the attribute value and tries to cast it to a particular data type. Examples are getIntValue(), getFloatValue(), etc. If value can not be converted it the method throws an error.

slide-30
SLIDE 30

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Inserting Content

Insertion of content works on two levels, using the versatile addContent() method: Document Level addContent(...) add one Content

  • bject (e.g. an element) or a collection of Content
  • bjects to the document. Given an index value it adds

the content at a particular position. Observe that you can not add a second root element! Element Level Similarly addContent() adds new Content

  • bjects as child nodes to the element. Again it is

possible to given a index to specify the position where they should be added. When given a string only, it will add that string to the field of the element.

slide-31
SLIDE 31

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Adding Attributes

We can also insert additional content by setting or resetting attributes of an element: setAttribute(x) Adds a new Attribute x. If it already exists, replaces it. setAttribute(name,value) Adds a new Attribute with name and value. Both name and value are strings. If already an attribute of that name exists, its value is replace by the new one. setAttributes(list) Sets the attributes of the elements to the new list of attributes thereby deleting all old

  • nes. The supplied Collection should contain only
  • bjects of type Attribute.
slide-32
SLIDE 32

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Deleting Content

removeContent() Removes all child content from this element. removeContent(x) If x is a of class Content this child is

  • removed. If x is an integer, it removes the child at this

position, if it exists. removeChild(x) x is a string. The methods removes the first child with name x. removeChildren() Deletes all children of that element. removeAttribute(x) Removes the attribute x given either by its name or as a Attribute object.

slide-33
SLIDE 33

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Constructing an XML Document

◮ We can create a document by simply declaring objects

  • f the appropriate classes in JDOM:

Document doc = new Document(); Element e = new Element("DocumentRoot"); e.setText("Main document starts here"); Element f = new Element("Child"); f.setText("This is a child element"); doc.setRootElement(e); e.addContent(f); ◮ This produces the document: <?xml version="1.0" encoding="UTF-8"?> <DocumentRoot> Main document starts here <Child> This is a child element </Child> </DocumentRoot>

slide-34
SLIDE 34

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Adding Contents

◮ Our example used setText to add text to an element,

and addContent to add a child element. We could equally have used addContent to do both:

e.addContent(new Text("Main document starts here")); e.addContent(f); ◮ The difference is that setText (and setContent)

deletes the previous content of the element.

◮ Note that elements can be changed even after they are

attached to the document.

◮ You can even change the names of elements: e.setName("SomethingElse"); <?xml version="1.0" encoding="UTF-8"?> <SomethingElse> Main document starts here <Child>This is a child element</Child> </SomethingElse>

slide-35
SLIDE 35

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Adding Attributes

◮ We can also create and set attributes. ◮ Note that setAttribute both changes the value of a

current attribute, or creates a new one.

e.setAttribute(new Attribute("x","0")); <?xml version="1.0" encoding="UTF-8"?> <DocumentRoot x="0"> Main document starts here <Child> This is a child element </Child> </DocumentRoot>

slide-36
SLIDE 36

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Saving Document to a File

◮ It is straightforward to create a file from a document in

Java as well:

new XMLOutputter().output(doc, System.out); ◮ This produces a raw test version of the document on

the console.

◮ We can also pretty print it for easier readability: XMLOutputter outp = new XMLOutputter();

  • utp.setFormat(Format.getPrettyFormat());
  • utp.output(doc, System.out);

<?xml version="1.0" encoding="UTF-8"?> <DocumentRoot> <Child> This is a child element </Child> </DocumentRoot>

slide-37
SLIDE 37

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Validating XML Documents

◮ In practice it is often useful to ensure that an XML file

has a desired format.

◮ This can be achieved by validating a document using a

Document Type Definitions (DTD).

◮ DTDs can be given either in the XML document itself

  • r in a separate file.

<DOCTYPE web-app ”http://....”¿! ◮ With JDOM we can enforce to validate an XML file

using:

SAXBuilder builder = new SAXBuilder(true); ◮ As a useful site effect the parser can exploit the DTD

information for more effective parsing.

slide-38
SLIDE 38

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

DTD Format

◮ DTD format specifies rules for the different types of

elements that occur in an XML document.

◮ Elements are given in terms of their allowed content,

i.e. type of children or text.

◮ For example the rule <ELEMENT font (name,size)!

specifies that a font element must always have two children, which are name and size.

<font> <name>Helvetica<\name> <size>36<\size> <\font>

slide-39
SLIDE 39

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Rules for Element Content

The rules for element content are essentially a regular expression language. In the following table E denotes the name of an element (e.g., font): Rule Meaning E∗ 0 or more occurrences of E E+ 1 or more occurrences of E E? 0 or 1 occurrences of E E1E2 . . . En One of E1, E2, . . . , En. E1, E2, . . . , En E1 followed by E2, . . . , En. #PCDATA Text ANY Any children allowed EMPTY No children allowed Any combination of the above, using brackets to determine precedence.

slide-40
SLIDE 40

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

SAX vs. DOM

◮ SAX (Simplified API for XML) parsers enable the

stream parsing of XML documents.

◮ They generate events as they read XML files, that need

to be handled explicitly.

◮ In fact, DOM (Document Object Model) parsers work

as event handler on SAX parsers building a tree representation of the input.

◮ The primary parsers in Java are SAX and StAX. ◮ StAX is more modern and offers a convenient iterator

facility on the input stream.

◮ We will have a brief look at document handling with

StAX.

slide-41
SLIDE 41

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Creating a StAX Parser

Creating a StAX parser consist of three steps:

  • 1. Create an input stream from which to read an XML file.
  • 2. Create a new instance of a StAX parser.
  • 3. Bind the new parser to the opened stream

Suppose we want to read XML from a URL given in the variable url, we can use the following code:

InputStream in = url.openStream(); XMLInputFactory factory = XMLInputFactory.newInstance(); XMLStreamReader parser = factory.createXMLStreamReader(in);

slide-42
SLIDE 42

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Events

There are essentially three different events we have to consider: START ELEMENT An opening tag. Returns the element name. END ELEMENT A closing tag. Returns the element name. CHARACTERS Enclosed text. Returns the content.

<font> <name>Helvetica<\name> <size units ="pt">36<\size> <\font>

  • 1. START ELEMENT, element name: font
  • 2. CHARACTERS, content: Whitespace
  • 3. START ELEMENT, element name: name
  • 4. CHARACTERS, content: Helvetica
  • 5. END ELEMENT, element name: name

. . .

slide-43
SLIDE 43

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Handling Events

Events have to be explicitly handled. Here is a loop that does this assuming that we have created a parser as described above:

while (parser.hasNext()) { int event = parser.next(); if (event == XMLStreamConstants.START_ELEMENT) { // Process the event details .... } }

We can observe that we have the same next(), hasNext() methods available similar to reading input with Scanner.

slide-44
SLIDE 44

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Processing Events

◮ There is some functionality to process events and the

details they contain.

◮ Methods to query the element type: isStartElement(),

isEndElement(), isCharacters(), isWhiteSpace().

◮ Methods to access content of elements:

getName() Name of a START ELEMENT, END ELEMENT event. getText() Text of a CHARACTERS event. getAttributeCount() Number of attributes in the current element. getAttributeName(i) Attribute name at index position i. getAttributeValue(i) Attribute value at index position i.

slide-45
SLIDE 45

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Data Exchange with XML Documents

◮ XML can serve as a standard basis to describe data. ◮ Any data structure that can be constructed in memory

can, in theory, be serialized into XML output.

◮ In practice this is not always straight forward. ◮ Observe that objects and data are separate entities from

their representation in XML trees. You have to copy data if necessary!

slide-46
SLIDE 46

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Translating between Objects and XML

In detail we have to:

◮ Design a mapping between objects and XML document

trees.

◮ Create object relationships during input and output. ◮ Referencing or dereferencing often make more than one

pass necessary:

◮ Mutual references: If obj1 references to obj2 and vice

versa, the reference can only be resolved once either are constructed.

◮ Loops: We have an object relation that contains loops

(e.g., in a graph) then references can only be constructed once all objects have been created.

slide-47
SLIDE 47

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

Example Problem

As an example problem we design a class used for recording information about a group of friends. The class has 4 fields with some accessors and mutators:

  • 1. name: the name of the friend: no two friends in the

document have the same name.

  • 2. note: some information about the friend.
  • 3. likes: an ArrayList of the Friends that this Friend

particularly likes (this list may be empty).

  • 4. bestFriend: the Friend who is the best friend of this
  • Friend. If this Friend does not have a best friend, then

this field will be null.

slide-48
SLIDE 48

SSC1: XML Volker Sorge Overview XML Format

Document Structure XML Components

Tree Parsing XML

Parsing Documents Walking the Tree Modifying the Tree

Generating XML Documents

Creating Documents Verifying Documents

Stream Parsing XML as a Data Exchange Format

XML data structure

The following XML document style should be used as data input/output format:

<?xml version="1.0" encoding="UTF-8"?> <Friends> <Friend name="Alice" likes="Faye Chris" bestfriend="Danny"> Has 2 cats and a dog </Friend> <Friend name="Bill" likes="Chris" bestfriend="Alice"> Into snowboarding </Friend> <Friend name="Chris" likes="Bill Harry"> Great cook </Friend> <Friend name="Danny"> Reads a lot </Friend> <Friend name="Eunice" likes="Bill" bestfriend="Ginny"> Learning to paint </Friend> <Friend name="Faye" likes="Bill Chris" bestfriend="Chris"></Friend> <Friend name="Ginny"> Always gets a laugh </Friend> <Friend name="Harry" bestfriend="Faye"> Plays guitar </Friend> </Friends>