12. Application program interfaces (APIs) XML documents are text - - PowerPoint PPT Presentation

12 application program interfaces apis
SMART_READER_LITE
LIVE PREVIEW

12. Application program interfaces (APIs) XML documents are text - - PowerPoint PPT Presentation

12. Application program interfaces (APIs) XML documents are text files in principle no special APIs are required. However, for example parsing and validation are tasks needed in almost any application. Predefined class libraries


slide-1
SLIDE 1

XML-12 J. Teuhola 2013 209

  • 12. Application program interfaces (APIs)
  • XML documents are text files – in principle no special

APIs are required.

  • However, for example parsing and validation are

tasks needed in almost any application.

  • Predefined class libraries and standardized interfaces

reduce programmer’s work & errors.

  • Main alternatives:

– Document Object Model (DOM) – Simple API for XML (SAX) – Streaming API for XML (StAX)

  • Example implementation by Sun: JAXP

(containing DOM, SAX, and XSLT)

slide-2
SLIDE 2

XML-12 J. Teuhola 2013 210

12.1. Document Object Model (DOM)

  • W3C recommendation: A tree-based interface:

reads and parses the whole document and places the tree in memory for processing.

  • Not tied to any programming language; Java

suits well (platform-independent, such as XML).

  • DOM Levels 1, 2, 3: Successively wider

support for various features of XML.

  • Interfaces are divided into modules, enabling

varying degrees of support for the API.

  • Here: Level 2 Core (2000; Level 3: 2004)
slide-3
SLIDE 3

XML-12 J. Teuhola 2013 211

About DOM specifications

  • Extensions have been defined for applications,

such as MathML, SVG, SMIL.

  • Alternatives for processing:

– Using only generic interfaces, like manipulating the Nodes. – Using application-specific interfaces, e.g. HTML: paragraphs, images, etc.

  • Specification language: Interface Description

Language (IDL by OMG) – independent of programming language and operating system.

  • Here: Java mapping (rather straightforward).
  • JDOM: Simplified DOM for Java
slide-4
SLIDE 4

XML-12 J. Teuhola 2013 212

Tentative DOM example (Xerces & Java)

import java.io.*; import org.w3c.dom.*; import org.apache.xerces.parsers.DOMParser; import org.xml.sax.*; … // Print the root tag name of document ”element.xml” DOMParser parser = new DOMParser(); try { parser.parse(”example.xml”); } catch (SAXException saxe) { … } catch (IOException ioe) { … } Document d = parser.getDocument(); Element root = d.getDocumentElement(); System.out.println(”Root: ” + root.getTagName());

slide-5
SLIDE 5

XML-12 J. Teuhola 2013 213

Important interfaces in DOM

  • Node is the root of all component interfaces.

– The whole document can be processed by the methods and properties defined for Node. – The in-memory document structure consists of nodes connected by parent, child and sibling links.

  • NodeList and NamedNodeMap for processing
  • f node sets
  • DocumentTraversal, NodeIterator,

TreeWalker for tree traversal and iteration

  • DOMImplementation for various purposes
  • … and many others
slide-6
SLIDE 6

XML-12 J. Teuhola 2013 214

Node interface hierarchy

DocumentFragment Document CharacterData Text CDATASection Attr Comment Node Element DocumentType Notation Entity EntityReference ProcessingInstruction

slide-7
SLIDE 7

XML-12 J. Teuhola 2013 215

Node methods

  • 1. Node characteristics:

getNodeType(), getNodeName(), getNodeValue(), setNodeValue(value), hasChildNodes(), getAttributes(), getOwnerDocument()

  • 2. Accessing relatives:

getFirstChild(), getLastChild(), getChildNodes(), getNextSibling(), getPreviousSibling(), getParentNode()

  • 3. Node manipulation:

removeChild(), insertBefore(newChild, refChild), appendChild(newChild), replaceChild(oldChild, NewChild), cloneNode(deep), normalize()

slide-8
SLIDE 8

XML-12 J. Teuhola 2013 216

Access directions in the document tree

first child last child parent next sibling previous sibling parent parent next sibling next sibling previous sibling previous sibling node node node node node

slide-9
SLIDE 9

XML-12 J. Teuhola 2013 217

Document interface

  • Represents the whole document

– Technically implemented as the root node of the document – Extends the Node interface. – Note: the root of DOM = parent of the actual document root.

  • Accessing the document information:

– getDocType() – getImplementation() – getDocumentElement() – getElementsByTagName(tagName)

  • DOM Level 2:

– getElementsbyTagNameNS(URI, localName) – getElementByID(elementID) – importNode(importedNode, deep) … and many others …

slide-10
SLIDE 10

XML-12 J. Teuhola 2013 218

Document interface (cont.)

  • Factory methods for creating objects to a doc:

– createElement(tagName) – createTextNode(data) – createComment(data) – createCDATASection (data) – createProcessingInstruction(target, data) – createAttribute(name) – createEntityReference(name)

  • Dom Level 2:

– createElementNS(URI, qualifiedName) – createAttributeNS(URI, qualifiedName)

slide-11
SLIDE 11

XML-12 J. Teuhola 2013 219

DocumentType interface

  • General data about the document (DTD):

– getName() DOCTYPE name = root name – getEntities() Internal and external entities as a list – getNotations() Notations as a list

  • DOM Level 2:

– getInternalSubset() – getPublicId() – getSystemId()

slide-12
SLIDE 12

XML-12 J. Teuhola 2013 220

Element interface

  • Extends the Node interface with element-

specific features:

– getTagName() – getElementsByTagName(name) – normalize() merge adjacent text elements

  • Attribute-related methods:

– getAttribute(name) – setAttribute(name, value) – removeAttribute(name) – getAttributeNode(name) – setAttributeNode(newAttr) – removeAttributeNode(oldAttr)

slide-13
SLIDE 13

XML-12 J. Teuhola 2013 221

Element interface (cont.)

  • DOM Level 2 element-specific extension:

– getElementsByTagNameNS(URI, localName)

  • Attribute-specific extensions

– hasAttribute(name) – hasAttributeNS(URI, localName) – getAttributeNS(URI, localName) – setAttributeNS(URI, qualName, value) – getAttributeNodeNS(URI, localName) – setAttributeNodeNS(newAttr) – removeAttributeNS(URI, localName)

slide-14
SLIDE 14

XML-12 J. Teuhola 2013 222

Attr interface

  • Information about attributes:

– getName() – getValue() – setValue(value) – getSpecified() false if the value originates from DTD – getOwnerElement() DOM Level 2

  • Note that most attribute-accessing operations

are part of the Element interface.

slide-15
SLIDE 15

XML-12 J. Teuhola 2013 223

CharacterData interface

  • Adds text processing methods to the Node

interface:

– getData() – setData(data) – getLength() – appendData(arg) – substringData(offset, count) – insertData(offset, arg) – deleteData(offset, count) – replaceData(offset, count, arg)

slide-16
SLIDE 16

XML-12 J. Teuhola 2013 224

Extensions (subtypes) of Character Data

  • Text interface

– One additional method: splitText(offset) – Creation by a factory method in Document: createTextNode(data)

  • CDATASection interface

– No additional methods; just identifies CDATA nodes (reminder: <![CDATA[ ... ]]>) – Creation by a factory method in Document

  • Comment interface

– No additional methods; identifies comments. – Creation by a factory method in Document

slide-17
SLIDE 17

XML-12 J. Teuhola 2013 225

ProcessingInstruction interface

  • Name of node = name of target application
  • Methods:

– getTarget() – getData() – setData(data)

  • Creation (by a factory method in Document):

– createProcessingInstruction(target, data)

slide-18
SLIDE 18

XML-12 J. Teuhola 2013 226

Entities and notations

  • Replacing entities by their values is parser-
  • dependent. External binary data cannot be

replaced, but entity references must be created.

  • Entity interface:

– getPublicId() – getSystemId() – getNotationName()

  • Notation interface:

– getPublicId() – getSystemId()

slide-19
SLIDE 19

XML-12 J. Teuhola 2013 227

Node lists and named node maps

  • Some DOM operations return a list of nodes;

NodeList interface:

– item(index), getLength()

  • Attribute and entity declarations have no

specific order; accessing is based on their names; NamedNodeMap interface:

– item(index), getLength(), getNamedItem(nodeName), setNamedItem(node), removeNamedItem(nodeName)

DOM Level 2:

– getNamedItemNS(URI, localName), setNamedItemNS(node), removeNamedItemNs(URI, localName)

slide-20
SLIDE 20

XML-12 J. Teuhola 2013 228

Testing the DOM implementation

  • DOMImplementation interface:

hasFeature(feature, version) where

– feature = module name: core, XML, HTML (DOM Level 1) Views, Events, Style, Traversal, Range (Level 2) More modules appear in Level 3. – version = ”1.0”, ”2.0”, ...

  • Other methods:

– createDocument(URI, qualifiedName, docType) – createDocumentType(qualifiedName, publicId, systemId)

slide-21
SLIDE 21

XML-12 J. Teuhola 2013 229

Tree traversal interfaces

  • DOM Level 2: Optional package for

sophisticated traversal of document trees.

  • DocumentTraversal interface:

An iterator can be created to choose node types and filter the nodes further.

  • NodeIterator interface:

Iteration steps: to the next/previous node

  • TreeWalker interface:

Like NodeIterator, but more versatile: first/last child, next/previous sibling, parent

  • NodeFilter interface: accept/reject/skip nodes.
slide-22
SLIDE 22

XML-12 J. Teuhola 2013 230

Processing of ranges

  • Ranges is an optional module in DOM Level 2.
  • A range is a segment between start and end

points; points are offsets from the start of the containing element.

  • Range interface: Methods e.g. for

– setting the start and end point, – comparing two ranges, – copying the contents of the range, – inserting new items to the range, – collapsing the range, – etc.

slide-23
SLIDE 23

XML-12 J. Teuhola 2013 231

12.2. Simple API for XML (SAX)

  • Straightforward, free, event-based API
  • Developed 1998 outside of W3C, but has a

standard-like status (http://www.saxproject.org).

  • Originally developed for Java, but has been

ported to e.g. C++, Python, Eiffel, and Perl.

  • A collection of interfaces in org.xml.sax.
  • Two versions: SAX1 and SAX2; the latter

includes namespace support. Old classes and methods are available, but deprecated.

  • Best-known implementation: Apache’s Xerces
slide-24
SLIDE 24

XML-12 J. Teuhola 2013 232

Principles of SAX

  • Event-based (’push’) model:

The parser scans the application text, and informs (by ’callback’) the application program about ’events’ like start-tag, end-tag, character data, processing instruction, etc.

  • The application programmer has to write the

required event handlers called by the parser: What to do at each specific situation.

  • Advantages: The whole document need not be

stored in memory; processing can start right away, not having to wait for complete loading.

slide-25
SLIDE 25

XML-12 J. Teuhola 2013 233

Schematic view of SAX parsing

main event handlers In the same or different classes Application program Parser create & call callback Must be configured before the call (coupling to the event handlers)

slide-26
SLIDE 26

XML-12 J. Teuhola 2013 234

Tentative example: Count students

import java.io.*; import org.xml.sax.*; import org.xml.sax.helpers.*; import org.apache.xerces.parsers.SAXParser; public class SaxCount extends DefaultHandler { int n=0; public void startElement( String namespaceURI, String localName, String qname, Attributes atts) { if (localName.equals(”student”)) n++; } // end of startElement

slide-27
SLIDE 27

XML-12 J. Teuhola 2013 235

Tentative example: main program

public static void main(String[] args) { SaxCount c = new SaxCount(); SAXParser p = new SAXParser(); p.setContentHandler(c); try { p.parse(args[0]); } catch (Exception e) { System.out.println(”Some error”); } System.out.println(”Number of students” + c.n); } // end of main } // end of class SaxCount

slide-28
SLIDE 28

XML-12 J. Teuhola 2013 236

SAX2 interfaces

  • XMLReader
  • ContentHandler
  • Attributes
  • XMLFilter
  • DTDHandler
  • EntityResolver
  • ErrorHandler
  • Locator
  • Plus some Helpers and Exception classes
slide-29
SLIDE 29

XML-12 J. Teuhola 2013 237

Parser creation methods

The parser is an instance of a concrete class implementing the XMLReader interface. Alternatives for creation: 1) Using a factory method:

XMLReader parser = XMLReaderFactory.createXMLReader()

2) As above, but giving an explicit parser class:

XMLReader parser = XMLReaderFactory.createXMLReader( ”org.apache.xerces.parsers.SAXParser”); 3) Using a constructor of the concr. parser class: SAXParser parser = new SAXParser();

slide-30
SLIDE 30

XML-12 J. Teuhola 2013 238

Parser call

  • The absolute/relative URL of the document to be

parsed is given as a parameter:

try { parser.parse(”test.xml”); } catch (SAXParseException pe) { ... } catch (SAXException se) { ... } catch (IOException ie) { ... }

  • To do something with the parsed document,

a ContentHandler must be specified to handle the events during the document processing.

  • A simpler class (’helper’): DefaultHandler:

Empty default methods for all events; redefine the

  • nes needing special processing.
slide-31
SLIDE 31

XML-12 J. Teuhola 2013 239

Methods in the ContentHandler Interface

  • setDocumentLocator(...)// Locator finds event position
  • startDocument()

// Called at start

  • endDocument()

// Called at end

  • startPrefixMapping(...)

// Before start tag with namesp.

  • endPrefixMapping(...)

// After end tag with namesp.

  • startElement(...)

// At start tag

  • endElement(...)

// At end tag

  • characters(...)

// continuous character string

  • ignorableWhiteSpace(...) // space, tab, newline
  • processingInstruction(...) // application-specific
  • skippedEntity(...) // E.g. for binary non-XML entities
slide-32
SLIDE 32

XML-12 J. Teuhola 2013 240

Attributes interface

  • The startElement() method gets from the parser

as a parameter the attribute list of type Attributes.

  • Methods for the list and its individual attributes,

to be used within startElement method:

– getLength() – getQName(index) – getValue(index) – getValue(QName) – getType(index) – getType(QName)

slide-33
SLIDE 33

XML-12 J. Teuhola 2013 241

Example: print start tags + attributes

public void startElement( String namespaceURI, String localName, String qname, Attributes atts) { if (namespaceURI=="") System.out.print("<“ + localName); else System.out.print("<“ + qname); for (int i=0; i<atts.getLength(); i++) { System.out.print(" "+atts.getQName(i)); System.out.print("=\""+atts.getValue(i)+"\""); } System.out.println(">"); } // end of startElement

slide-34
SLIDE 34

XML-12 J. Teuhola 2013 242

Setting features and properties

  • Variables for controlling/inspecting the SAX

parser behavior.

  • Feature and property names are URIs.
  • Features are Boolean-valued; examples:

– Validity check on/off (if ’on’, an ErrorHandler must be registered). – Parsing of entities on/off

  • Properties are Object-valued, e.g.:

– The current node – Source characters for the current event

slide-35
SLIDE 35

XML-12 J. Teuhola 2013 243

Examples of using features and properties

  • Set validation on:

parser.setFeature(“http://xml.org/sax/features/validation”, true);

  • Testing if validation is on or off:

boolean b = parser.getFeature(“http://xml.org/sax/features/validation”);

  • Getting the property object representing the literal string

that produced the current event:

String tag = (String)parser.getProperty(“http://xml.org/sax/properties/xml-string”);

slide-36
SLIDE 36

XML-12 J. Teuhola 2013 244

SAX filters

  • Filters can modify the ’traffic’ from the parser to the

event handlers.

  • Messages can be modified, replaced, blocked, or left

unchanged.

  • To the client, the filter behaves like a parser.
  • To the parser, the filter acts as a ContentHandler
  • A filter needs to implement several interfaces; helper

class XMLFilterImpl makes implementation simpler.

Parser Filter ContentHandler Event messages Filtered event messages

slide-37
SLIDE 37

XML-12 J. Teuhola 2013 245

StAX (Streaming API for XML) for Java

  • Intermediate solution between DOM and SAX.
  • SAX works in a ‘push’ mode: The parser has the control

to generate events, and the application program must capture them.

  • StAX works in a ‘pull’ mode (‘XmlPull’ was a

predecessor of StAX): The application has control of a cursor, moved sequentially forward.

  • Processing documents in textual order (streaming) is

much more efficient than building the whole document tree in memory.

  • Note: A large part of XSLT can be processed in

streaming mode (STX processor)

slide-38
SLIDE 38

XML-12 J. Teuhola 2013 246

StAX (cont.)

  • Some methods of interface XMLStreamReader:

next(), hasNext(), getText(), getLocalName(), getNamespaceURI().

  • Creating a reader using a ‘factory’:

XMLInputFactory f = XMLInputFactory.newInstance(); XMLStreamReader r = f.createXMLStreamReader(... );

  • Some methods of interface XMLStreamWriter:

writeStartElement(localName), writeEndElement(), writeCharacters(text)

  • Reference implementation:

http://stax.codehaus.org/