comp6037 semi structured data and the web tree grammars
play

COMP6037 Semi-structured Data and the Web Tree Grammars and Relax - PowerPoint PPT Presentation

COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1 General Stuff Read Blackboards Announcements Read Blackboards Discussions Forward your Blackboard email


  1. COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1

  2. General Stuff • Read Blackboard’s Announcements • Read Blackboard’s Discussions • Forward your Blackboard email to your email account • Start early with your coursework...trying to figure out what to do on Sunday night might be difficult! 2

  3. Schema languages...they are sooo different! • I hope you understand that – there are various issues involved in the validation of a document – validation and data quality are closely related – structure and content are different aspects of a document – people have built numerous formalisms & tools to help you validate your documents • with various applications in mind • with various goals re. generality/complexity/simplicity • with various expressive means/restrictions to describe – datatypes – structure • with various object-oriented mechanisms such as inheritance for ease of authoring & maintaining – which of them to choose (first/how to mix) depends on your application • I invite you to compare this with other areas, e.g., parser generators 3

  4. Schema languages...they are sooo different! • So far, you know 2 XML schema languages: – DTDs – WXS • There are many more for XML – SOX, SGML, RelaxNG, Schematron, … • They differ in 1. their style 2. usability 3. computability • how complex is it to validate documents? 4. expressive power (which is closely related to (3)) • data type support • structural expressiveness – what kind of trees can I describe? • uniqueness constraints – ... 4

  5. A comparison: who can describe structure better? • Obvious question: ➡ which of DTD, WXS, and RelaxNG has more powerful means to describe structure? – how to answer this? – perhaps they are orthogonal? – how to compare DTD (no types) with WXS, and RelaxNG (later)? – how do we know how costly validation is? • e.g., is validating a document D against a DTD more expensive than validating a document agains an XML Schema? • “Ah, let’s build DTD-validator and WXS-validator and compare run-time and memory requirement for similar (?) inputs on the same document!” • ...how can we make sure that we have been equally clever for both validators? • ...what are “similar inputs”? • ...how many do we need to test for? 5

  6. Interesting example where DTDs and WXS differ • assume we want to define <xs:complexType name="NameType"> <xs:all> an element NameType in <xs:element name="firstname" type="xs:string"/> a DTD…. <xs:element name="secondname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:all> </xs:complexType> • how? • ….it’s unmanageable and long (how?), but possible • are there other examples where something is not possible in DTD? Or in WXS? 6

  7. A comparison: who can describe structure better? • Use formal methods: 1. view XML document D as a simplified DOM tree T D • capture a schema S in a grammar/automaton G S • such that T D ∈ L(G S ) if and only if D validates against S • i.e., validation corresponds to acceptance by automaton 2. schema languages of type X ⇋ grammar/ automata of type C X 3. check: for each G S1 of type C X capturing a schema S1 of type X, can we build G S2 of type C Y capturing a schema S2 of type Y, such that L(G S1 ) = L(G S2 )? If yes, then Y is as expressive as X. • ...so, let’s do this: we will see definitions and lemmas/theorems ๏ definitions fix the meaning of terms in an unambiguous way (and you need to understand them to follow the rest) ★ lemmas and theorems state properties of the concepts introduced • we will see and work on examples 7

  8. Basics - regular expressions ๏ Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e 1 and e 2 ∈ regexp(N), then so are • e 1 ,e 2 (concatenation) • e 1 |e 2 (choice) • e 1 * (repetition) ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 8

  9. Basics - regular expressions ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 • Hence we can use – e + as abbreviation for (e,e*) – e? as abbreviation for (e| ε ) • Test: – does ababa match (a, b)* – does ababa match (a | b)* – does abababa match (a,b)* , a?, b, a, b* 9

  10. Towards trees... • A regular expression e describes a set of strings L(e): – L(e) := { w | w matches e} – L(e) can be finite, infinite,...can it be empty? • A schema S describes a set of trees L(S): – L(S) := { t | t validates against S} – L(S) can be finite, infinite, empty,….can it be empty? • As a simplification, we will look into Tree Grammars: • A tree grammar G describes a set of trees L(S): – L(G) := { t | G accepts t } 10

  11. Trees: nodes as strings! A tree A tree A tree over {A,B,C} with nodes as strings B ε ε A A 0 1 0 1 B 1,0 1,0 B A B 0,0 0,1 0,2 0,0 0,1 0,2 11

  12. Trees: nodes as strings! B ε ๏ We use ℕ for the non-negative integers (including 0) A A ๏ we use ℕ * for the set of all (finite) strings over ℕ 0 1 B • ε is used for the empty string 1,0 B A B • 0,1,0 is a string of length 3 0,0 0,1 0,2 • each string stands for a node ๏ An alphabet is a finite set of symbols ๏ A tree T over an alphabet Σ is a mapping T: ℕ * → Σ with a domain that is ๏ finite (i.e., T(n) is defined for only finitely many strings over ℕ ) ๏ contains ε (i.e., T( ε ) is defined) ๏ is prefixed-closed (i.e., if T(w,n) is defined, then T(w) is as well) • Explanation: • the strings in the domain of T represent T’s nodes • (w,n) is the successor of w, • T(w) is the label of w (as shown in picture) • we use nodes(T) for the (finite) domain of/nodes in T 12

  13. Tree Grammars: definition ๏ A tree grammar is a structure G = (N, Σ , S, P) where – N is a finite set of nonterminal symbols , – Σ is an alphabet , – S ⊆ N is a set of start symbols , and – P is a set of production rules , i.e., each R ∈ P is of the form • X → a e • where X ∈ N, a ∈ Σ , and e ∈ regexp(N) • Example: G = (N, Σ , S, P) with N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } 13

  14. Tree Grammars: what do they do? only super simple regexps N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } • a grammar runs on a tree... • given a tree T over Σ , if G can run on T, then – G is said to accept T, – written T ∈ L(G), – i.e., T is in the language accepted by G • remember: grammar G corresponds to schema S G and want to build G such that T ∈ L(G) holds iff T validates against S G • ...let’s see a grammar run... 14

  15. Tree Grammars: definition of runs ๏ A run of G = (N, Σ , S, P) on a tree T is a mapping r: nodes(T) → N such that: – r( ε ) ∈ S % r labels the root node of T with a start symbol – for each w ∈ nodes(T) with children w 1 w 2 ... w n , there exists a rule X → a e ∈ P such that Paper Book ε B ✘ • r(w) = X, • T(w) = a, and • r(w 1 ) r(w 2 )... r(w n ) matches e. Editor 0 Name • let’s see an example run of N = {Book, Author, Editor, Affilia, Paper, F, L} F 0,0 F L 0,1 L Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 15

  16. Tree Grammars: trees accepted ๏ A tree T is accepted by a grammar G if there is a run of G on T – we write T ∈ L(G). ε P ε B ε P • which is these trees is accepted by our grammar? 0 Name 0 Name 0 Name A A L L L F 0,0 0,1 0,0 0,1 0,0 0,1 N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, Name, F, L, A, P} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend