XPT 2006 XML Instances and Grammars 1
1 Document Instances and Grammars 1 Document Instances and Grammars
Fundamentals of hierarchical document Fundamentals of hierarchical document structures, or structures, or Computer Scientist Computer Scientist’ ’s view of XML s view of XML
1.1 XML and XML documents 1.1 XML and XML documents 1.2 Basics of document grammars 1.2 Basics of document grammars 1.3 Basics of XML 1.3 Basics of XML DTDs DTDs 1.4 XML Namespaces 1.4 XML Namespaces
XPT 2006 XML Instances and Grammars 2
2.1 XML and XML documents 2.1 XML and XML documents
- XML
XML -
- Extensible Markup Language,
Extensible Markup Language, W3C Recommendation, February 1998 W3C Recommendation, February 1998
– – not an official standard, but a stable industry standard not an official standard, but a stable industry standard – – 2 2nd
nd Ed 10/2000, 3
Ed 10/2000, 3rd
rd Ed 2/2004
Ed 2/2004
» » editorial revisions, editorial revisions, not not new versions of XML 1.0 new versions of XML 1.0
- a simplified subset of SGML, Standard
a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 Generalized Markup Language, ISO 8879:1987
– – what is said later about what is said later about valid valid XML documents applies XML documents applies to SGML documents, too to SGML documents, too
XPT 2006 XML Instances and Grammars 3
What is XML? What is XML?
- Extensible
Extensible Markup Language Markup Language is is not not a markup a markup language! language!
– – does not fix a tag set nor its semantics does not fix a tag set nor its semantics (like markup languages like HTML do) (like markup languages like HTML do)
- XML documents have
XML documents have no inherent no inherent (processing or (processing or presentation) presentation) semantics semantics
– – Implementing those semantics is the topic of this Implementing those semantics is the topic of this course! course!
XPT 2006 XML Instances and Grammars 4
What is XML (2)? What is XML (2)?
- XML
XML is is
– – a way to use markup to represent information a way to use markup to represent information – – a a metalanguage metalanguage
» » supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs DTDs (Document Type Definitions) or Schemas (Document Type Definitions) or Schemas » » E.g. XHTML a reformulation of HTML using XML E.g. XHTML a reformulation of HTML using XML
- Often
Often “ “XML XML” ” ≈ ≈ XML + XML technology XML + XML technology
– – that is, processing models and languages we that is, processing models and languages we’ ’re re studying (and many others ...) studying (and many others ...)
XPT 2006 XML Instances and Grammars 5
How How does does it it look? look?
<?xml version= <?xml version=’ ’1.0 1.0’ ’ encoding= encoding=” ”iso iso-
- 8859
8859-
- 1
1” ” ?> ?> < <invoice invoice num= num=” ”1234 1234” ”> > < <client client clNum= clNum=” ”00 00-
- 01
01” ”> > < <name>Pekka name>Pekka Kilpel Kilpelä äinen</ inen</name name> > < <email>kilpelai@cs.uku.fi email>kilpelai@cs.uku.fi</ </email email> > </ </client client> > < <item item price= price=” ”60 60” ” unit= unit=” ”EUR EUR” ”> > XML XML Handbook Handbook</ </item item> > < <item item price= price=” ”350 350” ” unit= unit=” ”FIM FIM” ”> > XSLT XSLT Programmer Programmer’ ’s s Ref Ref</ </item item> > </ </invoice invoice> >
XPT 2006 XML Instances and Grammars 6
Essential Features of XML Essential Features of XML
- Overview of XML essentials
Overview of XML essentials
– – many details skipped many details skipped – – Learn to consult original sources Learn to consult original sources (specifications, documentation etc) for details! (specifications, documentation etc) for details!
» » The XML specification is easy to browse The XML specification is easy to browse
- First of all, XML is a textual or character
First of all, XML is a textual or character-
- based
based way to represent data way to represent data
XPT 2006 XML Instances and Grammars 7
XML Document Characters XML Document Characters
- XML documents are made of ISO
XML documents are made of ISO-
- 10646 (32
10646 (32-
- bit)
bit) characters characters; in practice of their 16 ; in practice of their 16-
- bit Unicode
bit Unicode subset (used, e.g., in Java) subset (used, e.g., in Java)
– – Unicode 2.0 defines almost 39,000 distinct characters Unicode 2.0 defines almost 39,000 distinct characters
- Characters have three different aspects
Characters have three different aspects: :
– – their identification as numeric code points their identification as numeric code points – – their their representation representation by bytes by bytes – – their their visual presentation visual presentation
XPT 2006 XML Instances and Grammars 8
External Aspects of Characters External Aspects of Characters
- Documents are stored/transmitted as a sequence
Documents are stored/transmitted as a sequence
- f bytes (of 8 bits). An
- f bytes (of 8 bits). An encoding
encoding determines how determines how characters are characters are represented represented by bytes. by bytes.
– – UTF UTF-
- 8 (
8 (≈ ≈7 7-
- bit ASCII) is the XML default encoding
bit ASCII) is the XML default encoding – – encoding="KOI8R" encoding="KOI8R" should be OK for Cyrillic texts should be OK for Cyrillic texts
» » (but I cannot comment on parser support) (but I cannot comment on parser support)
- A
A font font (collection of character images called (collection of character images called glyphs glyphs) determines the ) determines the visual presentation visual presentation of
- f