 
              COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1
General Stuff • Read Blackboard’s Announcements • Read Blackboard’s Discussions • Forward your Blackboard email to your email account • Start early with your coursework...trying to figure out what to do on Sunday night might be difficult! 2
Schema languages...they are sooo different! • I hope you understand that – there are various issues involved in the validation of a document – validation and data quality are closely related – structure and content are different aspects of a document – people have built numerous formalisms & tools to help you validate your documents • with various applications in mind • with various goals re. generality/complexity/simplicity • with various expressive means/restrictions to describe – datatypes – structure • with various object-oriented mechanisms such as inheritance for ease of authoring & maintaining – which of them to choose (first/how to mix) depends on your application • I invite you to compare this with other areas, e.g., parser generators 3
Schema languages...they are sooo different! • So far, you know 2 XML schema languages: – DTDs – WXS • There are many more for XML – SOX, SGML, RelaxNG, Schematron, … • They differ in 1. their style 2. usability 3. computability • how complex is it to validate documents? 4. expressive power (which is closely related to (3)) • data type support • structural expressiveness – what kind of trees can I describe? • uniqueness constraints – ... 4
A comparison: who can describe structure better? • Obvious question: ➡ which of DTD, WXS, and RelaxNG has more powerful means to describe structure? – how to answer this? – perhaps they are orthogonal? – how to compare DTD (no types) with WXS, and RelaxNG (later)? – how do we know how costly validation is? • e.g., is validating a document D against a DTD more expensive than validating a document agains an XML Schema? • “Ah, let’s build DTD-validator and WXS-validator and compare run-time and memory requirement for similar (?) inputs on the same document!” • ...how can we make sure that we have been equally clever for both validators? • ...what are “similar inputs”? • ...how many do we need to test for? 5
Interesting example where DTDs and WXS differ • assume we want to define <xs:complexType name="NameType"> <xs:all> an element NameType in <xs:element name="firstname" type="xs:string"/> a DTD…. <xs:element name="secondname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:all> </xs:complexType> • how? • ….it’s unmanageable and long (how?), but possible • are there other examples where something is not possible in DTD? Or in WXS? 6
A comparison: who can describe structure better? • Use formal methods: 1. view XML document D as a simplified DOM tree T D • capture a schema S in a grammar/automaton G S • such that T D ∈ L(G S ) if and only if D validates against S • i.e., validation corresponds to acceptance by automaton 2. schema languages of type X ⇋ grammar/ automata of type C X 3. check: for each G S1 of type C X capturing a schema S1 of type X, can we build G S2 of type C Y capturing a schema S2 of type Y, such that L(G S1 ) = L(G S2 )? If yes, then Y is as expressive as X. • ...so, let’s do this: we will see definitions and lemmas/theorems ๏ definitions fix the meaning of terms in an unambiguous way (and you need to understand them to follow the rest) ★ lemmas and theorems state properties of the concepts introduced • we will see and work on examples 7
Basics - regular expressions ๏ Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e 1 and e 2 ∈ regexp(N), then so are • e 1 ,e 2 (concatenation) • e 1 |e 2 (choice) • e 1 * (repetition) ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 8
Basics - regular expressions ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 • Hence we can use – e + as abbreviation for (e,e*) – e? as abbreviation for (e| ε ) • Test: – does ababa match (a, b)* – does ababa match (a | b)* – does abababa match (a,b)* , a?, b, a, b* 9
Towards trees... • A regular expression e describes a set of strings L(e): – L(e) := { w | w matches e} – L(e) can be finite, infinite,...can it be empty? • A schema S describes a set of trees L(S): – L(S) := { t | t validates against S} – L(S) can be finite, infinite, empty,….can it be empty? • As a simplification, we will look into Tree Grammars: • A tree grammar G describes a set of trees L(S): – L(G) := { t | G accepts t } 10
Trees: nodes as strings! A tree A tree A tree over {A,B,C} with nodes as strings B ε ε A A 0 1 0 1 B 1,0 1,0 B A B 0,0 0,1 0,2 0,0 0,1 0,2 11
Trees: nodes as strings! B ε ๏ We use ℕ for the non-negative integers (including 0) A A ๏ we use ℕ * for the set of all (finite) strings over ℕ 0 1 B • ε is used for the empty string 1,0 B A B • 0,1,0 is a string of length 3 0,0 0,1 0,2 • each string stands for a node ๏ An alphabet is a finite set of symbols ๏ A tree T over an alphabet Σ is a mapping T: ℕ * → Σ with a domain that is ๏ finite (i.e., T(n) is defined for only finitely many strings over ℕ ) ๏ contains ε (i.e., T( ε ) is defined) ๏ is prefixed-closed (i.e., if T(w,n) is defined, then T(w) is as well) • Explanation: • the strings in the domain of T represent T’s nodes • (w,n) is the successor of w, • T(w) is the label of w (as shown in picture) • we use nodes(T) for the (finite) domain of/nodes in T 12
Tree Grammars: definition ๏ A tree grammar is a structure G = (N, Σ , S, P) where – N is a finite set of nonterminal symbols , – Σ is an alphabet , – S ⊆ N is a set of start symbols , and – P is a set of production rules , i.e., each R ∈ P is of the form • X → a e • where X ∈ N, a ∈ Σ , and e ∈ regexp(N) • Example: G = (N, Σ , S, P) with N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } 13
Tree Grammars: what do they do? only super simple regexps N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } • a grammar runs on a tree... • given a tree T over Σ , if G can run on T, then – G is said to accept T, – written T ∈ L(G), – i.e., T is in the language accepted by G • remember: grammar G corresponds to schema S G and want to build G such that T ∈ L(G) holds iff T validates against S G • ...let’s see a grammar run... 14
Tree Grammars: definition of runs ๏ A run of G = (N, Σ , S, P) on a tree T is a mapping r: nodes(T) → N such that: – r( ε ) ∈ S % r labels the root node of T with a start symbol – for each w ∈ nodes(T) with children w 1 w 2 ... w n , there exists a rule X → a e ∈ P such that Paper Book ε B ✘ • r(w) = X, • T(w) = a, and • r(w 1 ) r(w 2 )... r(w n ) matches e. Editor 0 Name • let’s see an example run of N = {Book, Author, Editor, Affilia, Paper, F, L} F 0,0 F L 0,1 L Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 15
Tree Grammars: trees accepted ๏ A tree T is accepted by a grammar G if there is a run of G on T – we write T ∈ L(G). ε P ε B ε P • which is these trees is accepted by our grammar? 0 Name 0 Name 0 Name A A L L L F 0,0 0,1 0,0 0,1 0,0 0,1 N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, Name, F, L, A, P} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 16
Recommend
More recommend