Informatics 1: Data & Analysis Lecture 11: Navigating XML using - PowerPoint PPT Presentation

Informatics 1: Data & Analysis Lecture 11: Navigating XML using XPath Ian Stark School of Informatics The University of Edinburgh Tuesday 25 February 2014 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da

Student Survey � What: Edinburgh Student Experience Survey (ESES) � Where: http://www.ed.ac.uk/students/surveys � When: before 1 March 2014 � Why: � You will help influence what we do at Edinburgh, and through this your own future experience here � Generate a cash donation from the University to Edinburgh Student Charities Appeal / EUSA Academic Societies Fund � Entry into iPad prize draw

http://www.ed.ac.uk/students/surveys

Lecture Plan XML We start with technologies for modelling and querying semistructured data . Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus , plural corpora . Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 11 2013-02-25

Sample Semistructured Data / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik Ian Stark Inf1-DA / Lecture 11 2013-02-25

Sample Semistructured Data in XML <? xml version ="1.0" encoding="UTF-8"?> <Gazetteer> <Country> <Name>Slovenia</Name> <Population>2,020,000</Population> <Capital>Ljubljana</Capital> <Region> <Name>Gorenjska</Name> <Feature type="Lake">Bohinj</Feature> <Feature type="Mountain">Triglav</Feature> <Feature type="Mountain">Spik</Feature> </Region> </Country> <! -- data for other countries here -- > </Gazetteer> Ian Stark Inf1-DA / Lecture 11 2013-02-25

How to Extract Information from an XML Document? Since an XML document is a text document, we could simply use conventional text search to look for data. However, this ignores all the document structure. A more powerful approach is to use a dedicated language for forming queries based on the tree structure of an XML document. This is (yet another) domain-specific language . With such a language we can, for example: Perform database-style queries on data published as XML; Extract annotated content from marked-up text documents; Identify information captured in the tree structure itself. Ian Stark Inf1-DA / Lecture 11 2013-02-25

XQuery and XPath XQuery is a powerful declarative query language for extracting information from XML documents. As well as using XML documents for its source data, XQuery can also produce XML documents as output, so we can view it as an XML transformation language. Interesting as the full XQuery language is, here we shall focus instead on a particular fragment. XPath is a sublanguage of XQuery, used for navigating XML documents using path expressions . XPath can be viewed as a query language in its own right. It is also an important component of other XML application languages (XML Schema, XSLT, XForms, . . . ). Ian Stark Inf1-DA / Lecture 11 2013-02-25

XPath Path Expressions An XPath path expression (or location path ) identifies a set of nodes within an XML document tree. The path expression describes a set of possible paths from the root of the tree. The set of nodes identified is all those reached as final destinations of these paths. When using a path expression as a query on a document, this set of nodes is returned as a list (without duplicates) sorted in document order — the order the nodes appeared in the original XML document. Ian Stark Inf1-DA / Lecture 11 2013-02-25

Family Tree Navigation Document order Siblings of A Ancestors of A Descendants of A Ian Stark Inf1-DA / Lecture 11 2013-02-25

Examples of Path Expressions The next few slides illustrate a selection of path expressions applied to the gazetteer example. Each expression appears twice: once using a standard abbreviated syntax, and once using full XPath. In each case, the nodes identified by the path are highlighted, and for a query would be retrieved in document order. Paths are built up step-by-step as the path expression is read from left to right, with a context node that travels over the tree according to the components of the path expression. The slash / at the start of a path expression indicates that the starting position for the context node is the document root. Ian Stark Inf1-DA / Lecture 11 2013-02-25

One Step / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik /Gazetteer / child ::Gazetteer Ian Stark Inf1-DA / Lecture 11 2013-02-25

Two Steps / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik /Gazetteer/Country / child ::Gazetteer/ child ::Country Ian Stark Inf1-DA / Lecture 11 2013-02-25

Children / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik /Gazetteer/Country/ ∗ / child ::Gazetteer/ child ::Country/ child :: ∗ Ian Stark Inf1-DA / Lecture 11 2013-02-25

Many Steps / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik //Name / descendant ::Name Ian Stark Inf1-DA / Lecture 11 2013-02-25

Matching Many Element Nodes / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik /Gazetteer/Country// ∗ / child ::Gazetteer/ child ::Country/ descendant :: ∗ Ian Stark Inf1-DA / Lecture 11 2013-02-25

Matching Element and Text Nodes / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik //Region//node() / descendant ::Region/ descendant ::node() Ian Stark Inf1-DA / Lecture 11 2013-02-25

Matching Text Nodes / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik //Region//text() // descendant ::Region/ descendant ::text() Ian Stark Inf1-DA / Lecture 11 2013-02-25

Matching Attribute Nodes / Gazetteer Country Data for other countries Name Population Capital Region Slovenia 2,020,000 Ljubljana Name Feature Feature Feature @type="Lake" @type="Mountain" @type="Mountain" Gorenjska Bohinj Triglav Spik //Feature/@type // descendant ::Feature/ attribute ::type Ian Stark Inf1-DA / Lecture 11 2013-02-25

Syntax for Path Expressions A path expression is a sequence of location steps separated by a / character. Each location step has the form � axis � :: � node-test �� predicate � ∗ The axis indicates which way the context node moves. The node test selects nodes of an appropriate type. The optional predicates supply further conditions that need to be satisfied to continue with the path. The examples so far used the child and descendant axes; node-tests node(), text(), ∗ , and individual names; and no predicates. Ian Stark Inf1-DA / Lecture 11 2013-02-25

Some Axes Different axes point in different directions from the current context node. child: immediate children (attribute nodes don’t count) descendant: any descendants (again, not attribute nodes) parent: the unique parent (root has no parent) attribute: all attribute nodes (context node must be an element node) self: the context node itself descendant-or-self: the context node together with its descendants. Ian Stark Inf1-DA / Lecture 11 2013-02-25

Some Node Tests Node tests select among all nodes along the current axis. text(): nodes with character data. node(): all kinds of node. ∗ : all nodes of the “principal” node type for this axis: for the attribute axis, this is attribute nodes; for any other axis, element nodes. Never text nodes. name : element nodes with the given name. The names used for node tests in the earlier examples were: Gazetteer, Country, Region, Feature and type. Ian Stark Inf1-DA / Lecture 11 2013-02-25

Informatics 1: Data & Analysis Lecture 11: Navigating XML using - PowerPoint PPT Presentation

Informatics 1: Data & Analysis Lecture 11: Navigating XML using XPath Ian Stark School of Informatics The University of Edinburgh Tuesday 25 February 2014 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey

Informatics BioMedical Informatics Imaging Informatics Richard H. Wiggins, III, MD, CIIP,

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Data and Analysis Part II Semistructured Data Alex Simpson Part II: Semistructured Data Inf1,

Data and Analysis Note 9 Data Acquisition and Annotation Alex Simpson Note 9 Data acquisition

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The

Informatics 1: Data & Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The

Informatics 1: Data & Analysis Lecture 20: Course Review Ian Stark School of Informatics

Informatics 1: Data & Analysis Lecture 10: Structuring XML Ian Stark School of Informatics

Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The

Henry Chu Professor, School of Computing and Informatics Executive Director, Informatics Research

Music Informatics Alan Smaill Jan 15 2018 Alan Smaill Music Informatics Jan 15 2018 1/29

International Challenge on Informatics and Computational Thinking Informatics Europe Best

Why Spanish accreditation of informatics degree Why Spanish accreditation of informatics degree

CRITICAL INFORMATICS Our stuff keeps your stuff from becoming their stuff CRITICAL INFORMATICS

The Impedance Mismatch is Our Fault Stuart Halloway Datomic Team, Clojure/core, Relevance 1 All

Fenwick Trees Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

query answering with description logic ontologies Meghyn Bienvenu (CNRS & Universit de

Taiwans Experience of Telehealthcare in Rural Natural Disaster Areas Disaster Areas Mei Ju

The Case for Change Notifications in Pull-Based Databases Wolfram Wingerath, Felix Gessert,

Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh

How the Internet Works 15-110 Wednesday 11/04 Learning Goals Recognize core terms related

Approaches to net neutrality in Norway, Europe and US Frode Sorensen (@ipfrode) Norwegian

Informatics 1: Data & Analysis Lecture 11: Navigating XML using - PowerPoint PPT Presentation

Informatics 1: Data & Analysis Lecture 11: Navigating XML using XPath Ian Stark School of Informatics The University of Edinburgh Tuesday 25 February 2014 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey

Informatics BioMedical Informatics Imaging Informatics Richard H. Wiggins, III, MD, CIIP,

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Data and Analysis Part II Semistructured Data Alex Simpson Part II: Semistructured Data Inf1,

Data and Analysis Note 9 Data Acquisition and Annotation Alex Simpson Note 9 Data acquisition

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data &amp; Analysis,

Informatics 1: Data &amp; Analysis Lecture 12: Corpora Ian Stark School of Informatics The

Informatics 1: Data &amp; Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The

Informatics 1: Data &amp; Analysis Lecture 20: Course Review Ian Stark School of Informatics

Informatics 1: Data &amp; Analysis Lecture 10: Structuring XML Ian Stark School of Informatics

Informatics 1: Data &amp; Analysis Lecture 12: Corpora Ian Stark School of Informatics The

Henry Chu Professor, School of Computing and Informatics Executive Director, Informatics Research

Music Informatics Alan Smaill Jan 15 2018 Alan Smaill Music Informatics Jan 15 2018 1/29

International Challenge on Informatics and Computational Thinking Informatics Europe Best

Why Spanish accreditation of informatics degree Why Spanish accreditation of informatics degree

CRITICAL INFORMATICS Our stuff keeps your stuff from becoming their stuff CRITICAL INFORMATICS

The Impedance Mismatch is Our Fault Stuart Halloway Datomic Team, Clojure/core, Relevance 1 All

Fenwick Trees Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

query answering with description logic ontologies Meghyn Bienvenu (CNRS &amp; Universit de

Taiwans Experience of Telehealthcare in Rural Natural Disaster Areas Disaster Areas Mei Ju

The Case for Change Notifications in Pull-Based Databases Wolfram Wingerath, Felix Gessert,

Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh

How the Internet Works 15-110 Wednesday 11/04 Learning Goals Recognize core terms related

Approaches to net neutrality in Norway, Europe and US Frode Sorensen (@ipfrode) Norwegian

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The

Informatics 1: Data & Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The

Informatics 1: Data & Analysis Lecture 20: Course Review Ian Stark School of Informatics

Informatics 1: Data & Analysis Lecture 10: Structuring XML Ian Stark School of Informatics

Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The

query answering with description logic ontologies Meghyn Bienvenu (CNRS & Universit de