COMP6037 Semi-structured Data and the Web Tree Grammars and Relax - - PowerPoint PPT Presentation
COMP6037 Semi-structured Data and the Web Tree Grammars and Relax - - PowerPoint PPT Presentation
COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1 General Stuff Read Blackboards Announcements Read Blackboards Discussions Forward your Blackboard email
General Stuff
- Read Blackboard’s Announcements
- Read Blackboard’s Discussions
- Forward your Blackboard email to your email account
- Start early with your coursework...trying to figure out what to do on Sunday
night might be difficult!
2
3
Schema languages...they are sooo different!
- I hope you understand that
– there are various issues involved in the validation of a document – validation and data quality are closely related – structure and content are different aspects of a document – people have built numerous formalisms & tools to help you validate your documents
- with various applications in mind
- with various goals re. generality/complexity/simplicity
- with various expressive means/restrictions to describe
– datatypes – structure
- with various object-oriented mechanisms such as inheritance for ease of
authoring & maintaining – which of them to choose (first/how to mix) depends on your application
- I invite you to compare this with other areas, e.g., parser generators
Schema languages...they are sooo different!
- So far, you know 2 XML schema languages:
– DTDs – WXS
- There are many more for XML
– SOX, SGML, RelaxNG, Schematron, …
- They differ in
- 1. their style
- 2. usability
- 3. computability
- how complex is it to validate documents?
- 4. expressive power (which is closely related to (3))
- data type support
- structural expressiveness
– what kind of trees can I describe?
- uniqueness constraints
– ...
4
A comparison: who can describe structure better?
- Obvious question:
➡ which of DTD, WXS, and RelaxNG has more powerful means to describe structure? – how to answer this? – perhaps they are orthogonal? – how to compare DTD (no types) with WXS, and RelaxNG (later)? – how do we know how costly validation is?
- e.g., is validating a document D against a DTD more expensive
than validating a document agains an XML Schema?
- “Ah, let’s build DTD-validator and WXS-validator and compare
run-time and memory requirement for similar (?) inputs on the same document!”
- ...how can we make sure that we have been equally clever for
both validators?
- ...what are “similar inputs”?
- ...how many do we need to test for?
5
Interesting example where DTDs and WXS differ
- assume we want to define
an element NameType in a DTD….
- how?
- ….it’s unmanageable and long (how?), but possible
- are there other examples where something is not possible in DTD?
Or in WXS?
6 <xs:complexType name="NameType"> <xs:all> <xs:element name="firstname" type="xs:string"/> <xs:element name="secondname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:all> </xs:complexType>
A comparison: who can describe structure better?
- Use formal methods:
- 1. view XML document D as a simplified DOM tree TD
- capture a schema S in a grammar/automaton GS
- such that TD ∈ L(GS) if and only if D validates against S
- i.e., validation corresponds to acceptance by automaton
- 2. schema languages of type X ⇋ grammar/automata of type CX
- 3. check: for each GS1 of type CX capturing a schema S1 of type X,
can we build GS2 of type CY capturing a schema S2 of type Y, such that L(GS1) = L(GS2)? If yes, then Y is as expressive as X.
- ...so, let’s do this: we will see definitions and lemmas/theorems
๏ definitions fix the meaning of terms in an unambiguous way (and you need to understand them to follow the rest) ★ lemmas and theorems state properties of the concepts introduced
- we will see and work on examples
7
Basics - regular expressions
๏ Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e1 and e2 ∈ regexp(N), then so are
- e1,e2 (concatenation)
- e1|e2 (choice)
- e1* (repetition)
๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w1 w2 and e = (e1 , e2) and w1 matches e1 and w2 matches e2 , or – if e = (e1 | e2) and w matches e1 or w matches e2 – if w = ε and e = e1* – if w = w1 w2... wn and e = e1* and each wi matches e1
8
Basics - regular expressions
๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w1 w2 and e = (e1 , e2) and w1 matches e1 and w2 matches e2 , or – if e = (e1 | e2) and w matches e1 or w matches e2 – if w = ε and e = e1* – if w = w1 w2... wn and e = e1* and each wi matches e1
- Hence we can use
– e+ as abbreviation for (e,e*) – e? as abbreviation for (e|ε)
- Test:
– does ababa match (a, b)* – does ababa match (a | b)* – does abababa match (a,b)* , a?, b, a, b*
9
Towards trees...
- A regular expression e describes a set of strings L(e):
– L(e) := { w | w matches e} – L(e) can be finite, infinite,...can it be empty?
- A schema S describes a set of trees L(S):
– L(S) := { t | t validates against S} – L(S) can be finite, infinite, empty,….can it be empty?
- As a simplification, we will look into Tree Grammars:
- A tree grammar G describes a set of trees L(S):
– L(G) := { t | G accepts t }
10
Trees: nodes as strings!
11
ε 1 0,0 0,1 0,2 1,0 A tree with nodes as strings A tree over {A,B,C} A tree ε 1 0,0 0,1 0,2 1,0 B A A A B B B
Trees: nodes as strings!
๏ We use ℕ for the non-negative integers (including 0) ๏ we use ℕ* for the set of all (finite) strings over ℕ
- ε is used for the empty string
- 0,1,0 is a string of length 3
- each string stands for a node
๏ An alphabet is a finite set of symbols ๏ A tree T over an alphabet Σ is a mapping T: ℕ* → Σ with a domain that is ๏ finite (i.e., T(n) is defined for only finitely many strings over ℕ) ๏ contains ε (i.e., T(ε) is defined) ๏ is prefixed-closed (i.e., if T(w,n) is defined, then T(w) is as well)
- Explanation:
- the strings in the domain of T represent T’s nodes
- (w,n) is the successor of w,
- T(w) is the label of w (as shown in picture)
- we use nodes(T) for the (finite) domain of/nodes in T
12
ε 1 0,0 0,1 0,2 1,0 B A A A B B B
Tree Grammars: definition
๏ A tree grammar is a structure G = (N, Σ, S, P) where – N is a finite set of nonterminal symbols, – Σ is an alphabet , – S ⊆ N is a set of start symbols, and – P is a set of production rules, i.e., each R ∈ P is of the form
- X → a e
- where X ∈ N, a ∈ Σ, and e ∈ regexp(N)
- Example: G = (N, Σ, S, P) with
13
N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε, Affiliation → A ε }
Tree Grammars: what do they do?
- a grammar runs on a tree...
- given a tree T over Σ, if G can run on T, then
– G is said to accept T, – written T ∈ L(G), – i.e., T is in the language accepted by G
- remember: grammar G corresponds to schema SG and want to build G such
that T ∈ L(G) holds iff T validates against SG
- ...let’s see a grammar run...
14
N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε, Affiliation → A ε }
- nly
super simple regexps
Tree Grammars: definition of runs
๏ A run of G = (N, Σ, S, P) on a tree T is a mapping r: nodes(T) → N such that: – r(ε) ∈ S % r labels the root node of T with a start symbol – for each w ∈ nodes(T) with children w1 w2... wn, there exists a rule X → a e ∈ P such that
- r(w) = X,
- T(w) = a, and
- r(w1) r(w2)... r(wn) matches e.
- let’s see an example run of
15
N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε } ε 0,0 B Name 0,1 F L Paper ✘ Book Editor F L
Tree Grammars: trees accepted
๏ A tree T is accepted by a grammar G if there is a run of G on T – we write T ∈ L(G).
- which is these trees
is accepted by our grammar?
16
N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, Name, F, L, A, P} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε } ε 0,0 P Name 0,1 L A ε 0,0 B Name 0,1 L A ε 0,0 P Name 0,1 F L
Tree Grammars: testing acceptance
- in the previous example, testing acceptance required guessing because,
for a Book, we don’t know whether Name is the name of Editor or Author... ๏ two different non-terminals N1, N2 ∈ N are competing if there are rules – N1 → a e1 and – N2 → a e2 in P.
- in our example, Editor and Author are competing
๏ A grammar is local if none of its non-terminals is competing
- ur grammar is not local: Editor and Author compete!
- clearly, non-local makes testing acceptance at least challenging...
17
N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε }
Tree Grammars: testing acceptance
- Does locality help to make testing acceptance “deterministic”?
- Is locality necessary to make testing acceptance “deterministic”?
- next, a notion that is weaker than locality:
๏ A grammar is single-type if
- none of its start symbols compete with each other and
- for each rule N → a e ∈ P,
no two non-terminals in e compete with each other
- does this sound familiar?
- ur example grammar is not single-type
- A bit like ‘Elements Declaration Consistent’ constraint in WXS? ...later more
18
Book → B Editor|Author, Editor → Name F,L, Author → Name L,Affilia,
Tree Grammars: testing acceptance
๏ A set of trees is called a tree language (like sets of strings are languages)
- A tree language can be empty, finite, or infinite
๏ A tree language TS is if there exists a tree grammar G such that L(G) = TS.
- for one TS, there may be different tree grammars accepting exactly TS…
19
local single-type regular local single-type
An Example
- G is not
single-type
- G’ is
single-type: Author and BA still compete, but don’t occur together in a rule!
- L(G’) = L(G)
- hence L(G) is
single-type!
20
G = (N, Σ,S, P) with N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε } G’ = (N’, Σ’,S’, P’) with N’ = {Book, Author, Editor, Affilia, Paper, F, L} Σ’ = {B, P, Name, F, L, A} S’ = {Book, Paper} P’ = { Book → B BA, Paper → P Author, BA → Name (F,L)|(L,Affilia), Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε } ๏ A grammar is single-type if
- none of its start symbols compete with each other and
- for each rule N → a e ∈ P,
no two non-terminals in e compete with each other
Properties of Local and Single-Type Tree Languages
- the following Lemma is an immediate consequences of the definitions of
local and single-type tree languages ★ Every local tree language is single-type, and every single-type tree language is regular.
- the next lemma is a bit more tricky:
★ There are regular tree languages that are not single-type, and there are single-type tree languages that are not local.
21
Loc ⊊ ST ⊊ Reg Loc ST Reg
Properties of Local and Single-Type Tree Languages
- to prove Loc ⊊ ST ⊊ Reg, it remains to give suitable (counter) examples:
– G’ is single-type, but not local (BA and Author compete), and there is no local G’’ with L(G’’) = L(G’). Hence L(G’) is not local, but single-type. – F is regular, but not single-type (N1, and N2 compete and occur in S-rule), and there is no single-type F’ with L(F’) = L(F). Hence L(F) is not single-type, but regular.
22
G’ = (N’, Σ’,S’, P’) with N’ = {Book, Author, BA, Affilia, Paper, F, L} Σ’ = {B, P, Name, A, F, L} S’ = {Book, Paper} P’ = { Book → B BA, Paper → P Author, BA → Name (F,L)|(L,Affilia), Author → Name L,Affilia, F → F ε, L → L ε, Affilia → A ε }
Loc ST Reg
F = (N, Σ, S, P) with N = {S, N1, N2, L} Σ = {T, N, text} S = {S} P = { S → T (N1,N2*), N1 → N L,L, N2 → N L, L → text ε} why?
Observation
- F is regular, but not single-type (N1, and N2 compete and occur in S-rule),
and there is no single-type F’ with L(F’) = L(F). Hence L(F) is not single-type, but regular.
- So single-type is really strict?!
23
F = (N, Σ, S, P) with N = {S, N1, N2, L} Σ = {T, N, t} S = {S} P = { S → T (N1,N2*), N1 → N L,L, N2 → N L, L → t ε} ε 2 T N 2,0 L 1 N 3 N 0 N 3,0 L 1,0 L 0,1 L 0,0 L 2,0,0 t 3,0,0 t 1,0,0 t 0,1,0 t 0,0,0 t
Tree Grammars: three more things
★ A single-type grammar can have no more than one run on a tree. ★ A regular grammar can have more than one run on a tree.
- BTW, w.l.o.g., we can assume that no two production rules have the same
non-terminal on the left hand side and the same terminal. I.e., no N → P Author and N → P (Editor,Editor*). We can also rewrite those, e.g., to N → P (Author| (Editor,Editor*))
- ...so, how did we get here? From DTDs and XML schemas!
24
Tree Grammars and DTDs
- tree grammars capture the basic, structural part of DTDs in a
straightforward way (ignoring attributes)
- since DTDs don’t have “types”, just element names, they correspond
to grammars of a peculiar, simple kind: ★ Tree grammars for DTDs are always local
- ...even if the DTD has a non-deterministic content model
<!ELEMENT N1 (M|(M,M))> is illegal and should be replaced with <!ELEMENT N1 (M,(M|ε))>
25
<!ELEMENT T (N1,N2*)> <!ELEMENT N1 (M|(M,M))> <!ELEMENT N2 (#PCDATA)> <!ELEMENT M (#PCDATA)> F = (N, Σ, S, P) with N = {T, N1, N2, M, pcdata} Σ = {T, N1, N2, M, pcdata} S = {T} P = { T → T (N1,N2*), N1 → N1 (M|(M,M)), N2 → N2 pcdata, pcdata → pcdata ε}
Content models in DTD and WXS - even more constraints!
- in DTDs and in WXS, content models are further restricted
(for compatibility with SGML) – [DTD] determistic (or 1-unambiguous), e.g., (M|(M,M)) is not deterministic, (M,(M|ε)) is. e.g., ((b, c) | (b, d)) is not deterministic, b,(c|d) is. From http://www.w3.org/TR/REC-xml/:
26
As noted in 3.2.1 Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors.
More formally: a finite state automaton may be constructed from the content model using the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression); if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in error and may be reported as an error.
Tree Grammars and DTDs
- so, DTDs are local (and thus single-type) because they don’t have types at
all – and not because their content model is deterministic! – they are single-type even with non-deterministic content model
- hence we could extend DTDs with types and still be single-type...provided
we impose suitable restrictions
27
Tree Grammars and WXS
- tree grammars capture the basic, structural part of WXS:
– types (complex and anonymous)
- model groups (we ignore them)
- derivation by extension and restriction (we ignore them)
- substitution groups (we ignore them)
- integrity constraints like keys (must be ignored, don’t fit into tree
grammars)
- we only deal with simple XML schemas, but general approach works for
more
- to transform an XML schema S into a tree grammar G,
- 1. we translate S into a generalized tree grammar
- 2. then flatten the generalized tree grammar into a tree grammar G
- this will be done such that T validates against S iff T is accepted by G.
28
Translating WXS into Tree Grammars
- let S be a simple XML Schema
➡ for each top-level element in S of the form
– <xs:element name="mylist" type="Blist"></xs:element>
- add the following production rule to your grammar
– MYLIST → mylist BLIST^TYPE – add MYLIST, BLIST^TYPE to non-terminals, add mylist to terminals ➡ for each top-level element in S of the form
– <xs:element name="mylist">
<xs:complexType> <xs:sequence> <xs:element name="ename" type="Comp" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element>
- add the following production rule to your grammar
– MYLIST → mylist ENAME,ENAME* – ENAME → ename COMP^TYPE
– add MYLIST, ENAME, COMP^TYPE to non-terminals, add mylist, ename to terminals
29
what is the default for minOccurs?
Translating WXS into Tree Grammars
➡ for each top-level element in S of the form – <xs:complexType name="Blist"> <xs:sequence> <xs:element name="friend" type='Person' minOccurs = ʻ1ʼ maxOccurs ='2'/> </xs:sequence> </xs:complexType>
- add the following production rules to your grammar
– BLIST^TYPE → (FRIEND|(FRIEND,FRIEND)) – FRIEND → friend PERSON^TYPE – add BLIST^TYPE, FRIEND, PERSON^TYPE to non-terminals, add friend to terminals
30 38
%% generalized rule: will be expanded!
Translating WXS into Tree Grammars
➡ for each top-level element in S of the form
- <xs:complexType name="BBlist">
<xs:choice> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="B" type="xs:string"/> </xs:sequence> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="C" type="xs:string"/> </xs:sequence> </xs:choice> </xs:complexType>
- add the following production rules to your grammar
– BBLIST^TYPE → (A,B)|(A,C) – A → A STRING^TYPE – B → B STRING^TYPE – C → C STRING^TYPE – add BBLIST^TYPE, A, B, C, STRING^TYPE to non-terminals, add A, B, C to terminals
31
%% generalized rule -- will be expanded! %% UPA - violation: %% Oxygen complains!
Translating WXS into Tree Grammars - a complication
- Consider the following case:
- To handle cases like the one above we can’t always add rules
– AT^TYPE → N*, BT^TYPE → N* – N → N ??LIST^TYPE
- Instead, we translate these as
– AT^TYPE → N^AS^ALIST^TYPE* BT^TYPE → N^AS^BLIST^TYPE* – N^AS^ALIST^TYPE → N ALIST^TYPE – N^AS^BLIST^TYPE → N BLIST^TYPE
32
<xs:complexType name="AT"> <xs:sequence> <xs:element name="N" type="Alist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="BT"> <xs:sequence> <xs:element name="N" type="Blist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>
Translating WXS into Tree Grammars - expanding rules
- ur translation yields almost a tree grammar:
- it produces illegal rules of the form X → e, i.e., without non-terminal
– e.g., BLIST^TYPE → (FRIEND|(FRIEND,FRIEND))
- ur grammar model doesn’t handle those (check definition of a run)
๏ hence we expand these illegal rules:
- e.g., MYLIST → mylist BLIST^TYPE would be transformed into
– MYLIST → mylist (FRIEND|(FRIEND,FRIEND))
- ...and if we had <xs:element name="yourlist" type="Blist"/> then we also had
– YOURLIST → mylist BLIST^TYPE and thus
– YOURLIST → mylist (FRIEND|(FRIEND,FRIEND))
33
for each illegal rule X → e: – remove X → e from rule set – replace all occurrences of X in rule set with e
Translating WXS into Tree Grammars - expanding rules
- Expanding illegal rules even works with cyclic type definitions - try
- This gives you these rules, including 2 illegal rules
- that can be expanded as follows:
34
<xs:complexType name="NType"> <xs:choice> <xs:element name="test2" type="AType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType> <xs:complexType name="AType"> <xs:choice> <xs:element name="test1" type="NType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType> NType^TYPE → (TEST2 | ENDELEMENT) TEST2 → test2 AType^TYPE ENDELEMENT → EndElement STRING^TYPE ... AType^TYPE → (TEST1 | ENDELEMENT) TEST1 → test2 NType^TYPE ENDELEMENT → EndElement STRING^TYPE ... TEST2 → test2 (TEST1 | ENDELEMENT) ENDELEMENT → EndElement STRING^TYPE ... TEST1 → test2 (TEST2 | ENDELEMENT) ENDELEMENT → EndElement STRING^TYPE ...
Translating WXS into Tree Grammars - all together
- So, to transform an XML schema S into a tree grammar G,
- 1. we translate S into a generalized tree grammar G’
- 2. then expand G’ into a tree grammar G
★ Then any tree T validates against S iff T is accepted by G.
- So, what are the tree grammars we get as results?
– they are tree grammars – are they single-type? – are they local? ★ Tree grammars corresponding to WXS are not local.
- E.g., consider
– N^AS^ALIST^TYPE → N ALIST^TYPE – N^AS^BLIST^TYPE → N BLIST^TYPE – .. N^AS^ALIST^TYPE and N^AS^BLIST^TYPE are competing!
35
Loc ST Reg
Translating WXS into Tree Grammars - all together
★ Tree grammars corresponding to WXS are single-type.
- This is ensured by the Unique Particle Attribution constraint in WXS.
- Together with the fact that DTDs are local,
we thus know that ★ DTDs are less expressive than XML schemata.
- That is, there are tree languages that we
can describe in WXS, but not in DTDs, e.g.,
Loc ST Reg
N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B,N,A,P,C} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → N F,L, Author → N L,Affilia, F → F ε, L → L ε, Affilia → A ε } L ε 0,0 B N 0,1 F ε 0,0 P N 0,1 A L
Content models and types in DTD and WXS
- (we already know that) in WXS, we have a type hierarchy
– an element of a type X derived by restriction or extension from Y can be used in place of an element of type Y – we call this ‘named’ typing:
- sub-types are declared (restriction or extension),
and not inferred (by comparing structure) – in DTDs, we don’t have types!
- In order to prevent difficulties in WXS as caused by types,
Element Declarations Consistent constraint (seen last week) is imposed:
37
<xs:complexType> <xs:sequence> <xs:element name="person" type= "NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type= "OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType>
Content models in DTD and WXS - even more constraints!
- in DTDs and in WXS, content models are further restricted
(for compatibility with SGML) – [WXS] Unique Particle Attribution constraint: A content model must be formed such that
- during validation of an element information item sequence (child
node sequence, cns),
- the particle component contained directly, indirectly or implicitly
therein (in the content model)
- with which to attempt to validate each item in the sequence (cns) in
turn can be uniquely determined
- without examining the content or attributes of that item (element
name suffice), and without any information about the items in the remainder of the sequence (no look-ahead into rest of cns).
- satisfying these restrictions is
– tricky (especially when combined with types as in WXS) and – perhaps not necessary? For validation?...let’s see what others do...
38
39
Relax NG: yet another schema language for XML
- Relax NG was designed to be a simpler schema language
- (described in a readable on-line book by Eric Van der Vlist)
- and allows us to describe (valid) XML documents in terms of their
tree abstractions: – no default attributes – no entity declarations – no key/uniqueness constraints – minimal datatypes: only “token” and “string” like DTDs (but a mechanism to use XSD datatypes)
- since it is so simple/flexible
– it’s (claimed to be) easy to use – it doesn’t have complex constraints on description of element content like determinism/1-unambiguity – it’s claimed to be reliable – but you need other tools to do other things (like datatypes and attributes)
40
Relax NG: another side of Determinism
- remember that DTDs and WXS required their content models to be
– [DTD] deterministic (and thus look-ahead-free) – [WXS] deterministic (EDC, every matching child node sequence matches in exactly one way only) – [WXS] UPA constraint expresses both and other constraints even more
- determinism & single-typeness have a reason:
– some tools annotate a (valid) document while parsing:
- type information -- to be exploited, e.g., for concise queries
(remember assignment?)
- default attribute values
– if your schema is not single-type, then
- tools validating the same document against the same schema may
construct different PSVIs
- this can happen with different tools or different runs of the same tool
41
Relax NG: another side of Validation
Reasons why one would want to validate an XML document:
- ensure that structure is ok
- ensure that values in elements/attributes are of the correct type
- generate PSVI to work with
- check constraints on co-occurrence of elements/how they are related
- check other integrity constraints, eg. a person age vs. their mother’s age
- check constraints on elements/their value against external data
– postcode correctness – VAT/tax/other numeric constraints – spell checking ...only few of these checks can be carried out by validating against schemas... Relax NG was designed to
- 1. validate structure and
- 2. link to datatype validators to type check values of elements/attributes
42
Relax NG: basic principles
- both DTDs and XSD allow the user to describe documents
– by descriptions of its elements and attributes, e.g., an element “person” must have two element child nodes, name and address, and ....
- Relax NG is based on patterns (similar to XPath expressions):
– a pattern is a description of a set of valid node sets – we can view our example as different combinations
- f different parts, and
design patterns for each – enhanced flexibility
<?xml version="1.0" encoding="UTF-8"?> <people> <person age=“41”> <name><first>Harry</first> <last>Potter</last></name> <address>4 Main Road </address> <project type=“epsrc” id=“1”>DeCompO</a> <project type=“eu” id=“3”>TONES</a> </person> <person> ..... </people>
43
Relax NG: good to know
Relax NG comes in 2 syntaxes
- the compact syntax
– succinct – human readable
- the XML syntax
– verbose – machine readable Trang converts between the two, pfew! (and also into/from
- ther schema
languages) Trang can be used from Oxygen grammar { start = element name { element first { text }, element last { text } }} <grammar xmlns="http:...” xmlns:a="http:.." datatypeLibrary="http:...> <start> <element name="name"> <element name="first"><text/></element> <element name="first"><text/></element> </element> </start> </grammar>
44
Relax NG - structure validation:
- 3 kinds of patterns, for the 3 “central” nodes:
– text <text/> – attribute <attribute name=”age"/> <attribute name=”type"/> – element <element name="name"> <element name="first"> <text/></element> <element name="last"> <text/></element> </element>
- these can be combined
–
- rdered groups
– unordered groups – choices
- we can constrain cardinalities of patterns
- text nodes
– can be marked as “data” and linked
- we can specify libraries of patterns
<?xml version="1.0" encoding="UTF-8"?> <people> <person age=“41”> <name> <first>Harry</first> <last>Potter</last></name> <address>4 Main Road </address> <project type=“epsrc” id=“1”> DeCompO</a> <project type=“eu” id=“3”> TONES</a> </person> <person> ..... </people>
element name { element first { text }, element last { text }}
45
Relax NG - structure validation: ordered groups
- we can name patterns
- in strange “chains”
- we can use ?, *, and +:
<?xml version="1.0" encoding="UTF-8"?> <people> <person age=“41”> <name> <first>Harry</first> <last>Potter</last></name> <address>4 Main Road </address> <project type=“epsrc” id=“1”> DeCompO</a> <project type=“eu” id=“3”> TONES</a> </person> <person> ..... </people>
grammar { start = people-element people-element = element people { person-element+ } person-element = element person { attribute age { text }, name-element, address-element+, project-element*} name-element = element name { element first { text }, element middle { text }?, element last { text } } address-element = element address { text } project-element = element project { attribute type { text }, attribute id {text}, text }}
use “?” if
- ptional
Relax NG - structure validation: ordered groups in XML syntax
46
<?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="people"><ref name="people-content"/> </element></start> <define name="people-content"> <oneOrMore> <element name="person"><ref name="person-content"/> </element></oneOrMore></define> <define name="person-content"> <attribute name="age"/> <element name="name"><ref name="name-content"/> </element> <oneOrMore> <element name="address"><text/></element> </oneOrMore> <zeroOrMore> <element name="project"><ref name="project-content"/> </element></zeroOrMore></define> <define name="name-content"> <element name="first"><text/></element> <optional><element name="middle"><text/></element> </optional> <element name="last"><text/></element> </define> <define name="project-content"> <attribute name="type"/><attribute name="id"/><text/> </define> </grammar>
grammar { start = people-element people-element = element people { person-element+ } person-element = element person { attribute age { text }, name-element, address-element+, project-element*} name-element = element name { element first { text }, element middle { text }?, element last { text } } address-element = element address { text } project-element = element project { attribute type { text }, attribute id {text}, text }}
47
Relax NG - structure validation: different styles
grammar { start = element people {people-content} people-content = element person { person-content }+ person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}* name-content = element first { text }, element middle { text }?, element first { text } project-content = attribute type { text }, attribute id {text}, text }
grammar { start = people-element people-element = element people { person-element+ } person-element = element person { attribute age { text }, name-element, address-element+, project-element*} name-element = element name { element first { text }, element middle { text }?, element last { text } } address-element = element address { text } project-element = element project { attribute type { text }, attribute id {text}, text }}
- so far, we modelled ‘element centric’...we can model ‘content centric’:
48
Relax NG - structure validation: ordered groups
- we can combine patterns in fancy ways:
<?xml version="1.0" encoding="UTF-8"?> <people> <person age=“41” phone=“12567”> <name> <first>Harry</first> <last>Potter</last></name> <address>4 Main Road </address> <project type=“epsrc” id=“1”> DeCompO</a> <project type=“eu” id=“3”> TONES</a> </person> <person> ..... </people>
grammar {start = element people {people-content} people-content = element person { person-content }+ person-content = HR-stuff, contact-stuff HR-stuff = attribute age { text }, project-content contact-stuff = attribute phone { text }, element name {name-content}, element address { text } name-content = element first { text }, element middle { text }?, element first { text } project-content = element project { attribute type { text }, attribute id {text}, text }+
49
Relax NG: structure validation summary
- Relax NG’s specification of structure differs from DTDs and XSD:
– grammar oriented – 2 syntaxes with automatic translation – flexible: we can gather different aspects of elements into different patterns – unconstrained: no constraints regarding unambiguity/1-ambiguity/deterministic content model/Unique Particle Constraints/Element Declarations Consistent – like for XSD, we have an “ALL” construct for unordered groups, “interleave” &: element person { attribute age { text}, attribute phone { text}, name-element , address-element+ , project-element*} here, the patterns must appear in the specified order, (except for attributes, which are allowed to appear in any order in the start tag): here, the patterns can appear any order: element person { attribute age { text } & attribute phone { text} & name-element & address-element+ & project-element*}
50
Relax NG: text value constraints
- Relax NG is mostly about a document’s structure
- provides only 2 datatypes for text nodes and attributes, and they
differ only in whitespace handling: – string: the usual string, without whitespace normalization – token: for strings separated by whitespace, with whitespace normalization
attribute coursetype1 {token ”UG"|token ”taught PG"|token ”research PG”} attribute coursetype2 {string ”UG"|string ”taught PG"|string ”research PG”} <student coursetype1=“taught PG”/> is ok, <student coursetype2=“taught PG”/> is ok, but <student coursetype2=“ taught PG ”/> is not
51
Relax NG: text value constraints
- as in the previous example, we can enumerate “legal” values
- but we can also specify “illegal values”
- and build lists of values
- ...but not much else...without datatype libraries
attribute coursetype1 {token ”UG" | token ”taught PG" | token ”research PG”} attribute coursetype2 {token - (string ”UG” | string “MG”)} attribute courselist {list {token+}} attribute shortcourselist {list {token,token?,token?,token?}}
52
Relax NG: linking to datatype libraries
- Relax NG provides a mechanism to plug in external libraries for the type
validation of the content of text nodes and attributes – not limited to a specific set of libraries – WXS datatypes as a natural choice used in our examples:
- you know them already and
- they are quite powerful
– there also exists a DTD compatibility library to check
- whether ID values are unique within a document
- that IDREF is a reference to ID value present in document
- that IDREFS is a list of references to ID values defined in document
- ...obviously, the latter requires far more than a type validation,
hence this library is more than a datatype library
datatypes dtd="http://relaxng.org/ns/compatibility/datatypes/1.0" element book { attribute id { dtd:ID }, attribute date { xsd:date }, attribute references {dtd:IDREFS} }
Translating Relax NG in Tree Grammars...
- we will look into this next week