COMP60411 Semi-structured Data and the Web Validating Trees against - - PowerPoint PPT Presentation

comp60411 semi structured data and the web validating
SMART_READER_LITE
LIVE PREVIEW

COMP60411 Semi-structured Data and the Web Validating Trees against - - PowerPoint PPT Presentation

COMP60411 Semi-structured Data and the Web Validating Trees against Tree Grammars The Essence of XML and Errors Bijan Parsia and Uli Sattler University of Manchester 1 Saturday, 29 October 2011 1 Last week... ...we have designed our first


slide-1
SLIDE 1

1

COMP60411 Semi-structured Data and the Web Validating Trees against Tree Grammars The Essence of XML and Errors

Bijan Parsia and Uli Sattler

University of Manchester

1 Saturday, 29 October 2011

slide-2
SLIDE 2

Last week...

...we have designed our first “schema validator” algorithm

  • for local tree grammars first
  • that can be implemented by

– walking the DOM tree in a depth-first, left-2-right way, or – using a SAX parser to do it in a streaming fashion

  • thus uses memory space linear in the depth of the input tree
  • that uses stacks

– to keep track of

  • each rule that a node’s validation needs to check against: R

written on the way down, checked on the way up

  • result of child nodes validations: which non-terminal symbols did they validate

with?

2 ValAlgo Tree T Grammar G “yes”, if T ∈ L(G) “no”, otherwise

local⇒unique!

2 Saturday, 29 October 2011

slide-3
SLIDE 3

This week...

...we expand the algorithm

  • first to single-type

– this gives us automatically a validator for structural aspect of WXS – will be rather straightforward

  • then to general tree grammars

– this gives us automatically a validator for Relax NG schemas – will be more tricky: we’ll still use stacks to keep track of

  • all rule that a node’s validation needs to check against: R

written on the way down, checked on the way up

  • result of child nodes validations: which non-terminal symbols did they validate

with?

  • both can be implemented by

– walking the DOM tree in a depth-first, left-2-right way, or – using a SAX parser to do it in a streaming fashion

  • thus use memory space linear in the depth of the input tree

– which is quite impressive/surprising for general/Relax NG

3 ValAlgo Tree T Grammar G “yes”, if T ∈ L(G) “no”, otherwise

3 Saturday, 29 October 2011

slide-4
SLIDE 4

add E’s terminal node to its predecessor siblings 4

ValAlgo XML doc/Tree T single-type Grammar G “yes”, if T ∈ L(G) “no”, otherwise

See the paper by Murata, Lee, Mani, Kawaguchi store rule for E’s content in R start remembering E’s child nodes retrieve rule for E’s content in R retrieve E’s child nodes

Input: DOM Tree for T, single-type tree grammar G = (N, Σ, S, P), NT is a stack of strings of non-terminals R is a stack of production rules Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag name and (E is root and N in S or N occurs in RHS of topmost rule in R) then push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,

pop a rule N → a e out of R pop a string of non-terminals w out of NT if w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NT else report “not accepted” and stop

report “accepted” and stop

single-type ⇒ unique rule!

nothing changed

4 Saturday, 29 October 2011

slide-5
SLIDE 5

When an element E is visited on way down, if there is a production rule N → a e in P with a = E’s tag name and (E is root and N in S or N occurs in RHS of topmost rule in R) then push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

5

a c c b c b

ValAlgo XML doc/Tree T single-type Grammar G “yes”, if T ∈ L(G) “no”, otherwise

  • Let’s see how algorithm works:

– G = ({S,B,C},{a,b,c},{S},P) with P = { S → a B,B*,D B → b (C,C)|C, C → c ϵ|C, D → c C,C,C} – ...in order to know which production rule N → c ... to chose for nodes labelled c, I need to check rule for predecessor and ensure that N

  • ccurs in RHS chosen for them...

5 Saturday, 29 October 2011

slide-6
SLIDE 6
  • want to implement this algorithm? Again, as for local tree grammars,

– walk the DOM tree in a depth-first, left-2-right way, or – use a SAX parser and do it in a streaming fashion

  • now, this was for single-type tree grammars, let’s see how this works

for

– general tree grammars

  • we can have competing non-terminal symbols in RHS of rules
  • how do we know with which to continue?
  • try/guess one and, if failed, backtrack?
  • or by keeping track of all possibilities

– and, as long as we have some, everything is fine.. – which means we need some more stacks for track keeping...

6

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

6 Saturday, 29 October 2011

slide-7
SLIDE 7

7 store non-terminals from RHS of possibly applicable rules

we don’t know which to use!

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

Input: DOM Tree for T, a tree grammar G = (N, Σ, S, P), NT is a stack of strings of sets of non-terminals R is a stack of sets of production rules NS is a stack of sets of non-terminals, init with S Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS

  • nto NS

else report “not accepted” and stop When an element E is visited on way up,

pop a rule set RS = {Ni → a ei | i = 1..k} out of R

7 Saturday, 29 October 2011

slide-8
SLIDE 8

Input: DOM Tree for T, a tree grammar G = (N, Σ, S, P), NT is a stack of strings of sets of non-terminals R is a stack of sets of production rules NS is a stack of sets of non-terminals, init with S Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop

When an element E is visited on way up,

pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop

report “accepted” and stop

8 store non-terminals from RHS of possibly applicable rules

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

8 Saturday, 29 October 2011

slide-9
SLIDE 9
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C), C → a (A,A,A)|ϵ}

9

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} a a a a a ➀ ➁ ➂

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

9 Saturday, 29 October 2011

slide-10
SLIDE 10
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

10

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} RS = {➀, ➁, ➂} NS {A,B,C} {A,B,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

10 Saturday, 29 October 2011

slide-11
SLIDE 11
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

11

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} RS = {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

11 Saturday, 29 October 2011

slide-12
SLIDE 12
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

12

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} RS = {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} ϵ {A,B,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

12 Saturday, 29 October 2011

slide-13
SLIDE 13
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

13

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} RS = {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} ϵ {A,B,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

13 Saturday, 29 October 2011

slide-14
SLIDE 14
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

14

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} RS = {➀, ➁, ➂} ϵ = W1...Wk {A,B,C} W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

14 Saturday, 29 October 2011

slide-15
SLIDE 15
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

15

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C} {A,B,C} RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

15 Saturday, 29 October 2011

slide-16
SLIDE 16
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

16

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C} {A,B,C} RS = {➀, ➁, ➂} {A,B,C} {➀, ➁, ➂} ϵ

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

16 Saturday, 29 October 2011

slide-17
SLIDE 17
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

17

a a R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS {A,B,C} {A,B,C} a a a a a ➀ ➁ ➂ ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C} {A,B,C} {A,B,C} RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

17 Saturday, 29 October 2011

slide-18
SLIDE 18
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

18

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C},{A,C} {A,B,C}

RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

18 Saturday, 29 October 2011

slide-19
SLIDE 19
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

19

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C},{A,C} {A,B,C}

RS = {➀, ➁, ➂}

{➀, ➁, ➂} ϵ {A,B,C} ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

19 Saturday, 29 October 2011

slide-20
SLIDE 20
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

20

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C},{A,C} {A,B,C} {A,B,C}

RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

20 Saturday, 29 October 2011

slide-21
SLIDE 21
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

21

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {➀, ➁, ➂} {A,C},{A,C},{A,C} {A,B,C}

RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

21 Saturday, 29 October 2011

slide-22
SLIDE 22
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

22

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} ϵ {A,B,C} {A,B,C} RS = {➀, ➁, ➂} {A,C},{A,C},{A,C} = W1...W3

W = {C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

22 Saturday, 29 October 2011

slide-23
SLIDE 23
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

23

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} {C} {A,B,C} RS = {➀, ➁, ➂}

W = {C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise {A,C},{A,C},{A,C} = W1...W3

23 Saturday, 29 October 2011

slide-24
SLIDE 24
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

24

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} {C} {A,B,C}

RS = {➀, ➁, ➂}

{➀, ➁, ➂} ϵ {A,B,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

24 Saturday, 29 October 2011

slide-25
SLIDE 25
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

25

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} {C} {A,B,C}

RS = {➀, ➁, ➂}

{A,B,C} ϵ = W1...Wk

W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

25 Saturday, 29 October 2011

slide-26
SLIDE 26
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

26

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {➀, ➁, ➂} {C},{A,C} {A,B,C}

RS = {➀, ➁, ➂} ϵ = W1...Wk W = {A,C}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

26 Saturday, 29 October 2011

slide-27
SLIDE 27
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

27

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

ϵ {➀, ➁, ➂} NS {A,B,C} {A,B,C} {A,B,C}

RS = {➀, ➁, ➂}{C},{A,C} = W1...Wk W = {B}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

27 Saturday, 29 October 2011

slide-28
SLIDE 28
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

28

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

{B} {➀, ➁, ➂} NS {A,B,C} {A,B,C}

RS = {➀, ➁, ➂}{C},{A,C} = W1...Wk W = {B}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

28 Saturday, 29 October 2011

slide-29
SLIDE 29
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

29

a a

R NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

NS {A,B,C} {A,B,C}

RS = {➀, ➁, ➂}{B} = W1...Wk W = {A,B}

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

29 Saturday, 29 October 2011

slide-30
SLIDE 30
  • Let’s see how algorithm works:

– G = ({A,B,C},{a},{A,B,C},P) with P = { A → a B|ϵ, B → a (C,C) | B, C → a (A,A,A)|ϵ}

30

a a R

NT

When an element E is visited on way down, set RS to the set of production rules N → a e in P with a = E’s tag name and (N occurs in topmost set of NS) if RS is non-empty then push RS onto R, push ϵ onto NT, push set of all non-terminals occurring in RHS of a rule in RS to NS else report “not accepted” and stop When an element E is visited on way up, pop a rule set RS = {Ni → a ei | i = 1..k} out of R pop a string of sets of non-terminals W1...Wk out of NT set W to the set of those Ni such that there is a w1...wk with each wj from Wj that matches ei if W is non-empty then pop a string V1...Vm of non-terminals out of NT push V1...VmW onto NT, pop NS else report “not accepted” and stop report “accepted” and stop

NS

a a a a a ➀ ➁ ➂

NS {A,B,C} “accepted”/“yes”, T is accepted by G

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

30 Saturday, 29 October 2011

slide-31
SLIDE 31
  • Implementing this algorithm? Again, as for single-type tree grammars,

– walk the DOM tree in a depth-first, left-2-right way, or – use a SAX parser and do it in a streaming fashion

  • Insights gained?
  • Validating general tree grammars
  • does not require guessing & backtrack
  • can be implemented in a streaming way
  • is a bit more tricky than validating single-type grammars,
  • but not really more complex (in terms of time/space)

– still only space linear in depth of input tree

  • so, for validating purposes, restrictions to single-type is not necessary

– feel free to describe structure in a powerful way!

  • but, for uniqueness of PSVI,

– we need single-type

31

ValAlgo XML doc/Tree T any tree Grammar G “yes”, if T ∈ L(G) “no”, otherwise

31 Saturday, 29 October 2011

slide-32
SLIDE 32

From Tree Grammars to Schema Languages

  • Different schema languages for different purposes

– testing structural

  • do persons’ names have both a first and second name?

– testing type constraints

  • is age an integer? And DoB a date?

– describing a handy PSVI

  • adding default values or type information for easy/robust querying/manipulation

– … – single-typedness useful for some, but not all purposes! – locality?

  • Your applications might use different schemas for different purposes
  • ...and there are purposes none of our schema languages can serve:

– in CW4, not all valid input documents were really grammars – checking whether non-terminals are mentioned correctly is beyond XSD’s abilities...we need an even more powerful schema language!

32

32 Saturday, 29 October 2011

slide-33
SLIDE 33

Other interesting questions

...closely related to validation are

  • Schema emptiness:

– given a schema/grammar S, does there exist a document/tree d such that d is valid w.r.t. S – relevant as a basic consistency test for schemas

  • Schema containment:

– given schemas/grammars S1, S2, is S1 a specialization of S2? – i.e., is every document that is valid w.r.t. S1 also valid w.r.t. S2? – relevant to support tasks such as schema refinement:

  • if I say I want to refine S2,
  • then it would be nice if this intention could be later verified to ensure that I did

what I wanted

– also solves schema equivalence: see your coursework!

  • ...a lot of research in both areas

33

33 Saturday, 29 October 2011

slide-34
SLIDE 34

Bye for now! (I’ll be around) I have enjoyed working with you, and hope you learned loads and also enjoyed the experience!

34

34 Saturday, 29 October 2011

slide-35
SLIDE 35

Internal to External

Or, spill your guts

35

35 Saturday, 29 October 2011

slide-36
SLIDE 36

What the...?!?

36

(Pinkwashing

  • bscured)

36 Saturday, 29 October 2011

slide-37
SLIDE 37

JSON (1)

  • Javascript has a rich set of literals (ext. reps)

– Atomic (numbers, booleans, strings*)

  • 1, 2, true, “I’m a string”

– Composite

  • Arrays

– Ordered lists with random access – [1, 2, “one”, “two”]

  • “Objects”

– Associative arrays/dictionary – {“one”:1, “two”:2}

  • These can nest!

– [{“one”:1, “o1”:{“a1”: [1,2,3.0], “a2”:[]}]

  • JSON == roughly this subset of Javascript

– The internal representation varies

  • In JS, 1 represents a 64 bit, IEEE floating point number
  • In Python’s json module, 1 represents a 32 bit integer in two’s complement

*Strings can be thought of as a composite, i.e., an array of characters, but not here.

37

37 Saturday, 29 October 2011

slide-38
SLIDE 38

JSON (2)

{"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }} <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu> http://www.json.org/example.html

Slightly different!

38

38 Saturday, 29 October 2011

slide-39
SLIDE 39

JSON (2.1)

{"menu": [{ "id": "file", "value": "File"}, "popup": [ "menuitem": {"value": "New", "onclick": "CreateNewDoc()"}, "menuitem": {"value": "Open", "onclick": "OpenDoc()"}, "menuitem": {"value": "Close", "onclick": "CloseDoc()"} ] ] }} <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu> http://www.json.org/example.html

Needed to preserve

  • rder!

Still not right!

39

39 Saturday, 29 October 2011

slide-40
SLIDE 40

JSON (2.2)

{"menu": [{"id": "file", "value": "File"}, [{"popup": [{}, [{"menuitem": [{"value": "New", "onclick": "CreateNewDoc()"},[]]}, {"menuitem": [{"value": "Open", "onclick": "OpenDoc()"},[]]}, {"menuitem": [{"value": "Close", "onclick": "CloseDoc()"},[]]} ] ] } ] ] } <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu> http://www.json.org/example.html 40

40 Saturday, 29 October 2011

slide-41
SLIDE 41

JSON (2.1) Recipe

  • Elements are mapped to “objects”

– With one pair

  • ElementName : contents
  • Contents are a list

– First item is an “object”, the attributes

  • Attributes are pairs of strings

– Second item is a list (of children)

  • Empty elements require an explicit empty list
  • No attributes requires an explicit empty object

Cumbersome!

41

41 Saturday, 29 October 2011

slide-42
SLIDE 42

JSON vs. XML (expressivity)

  • Every XML WF DOM can be faithfully represented as a

JSON object

  • Every JSON object can be faithfully represented as an

XML WF DOM

  • Every WXS PSVI can be faithfully represented as a

JSON object

  • Every JSON object can be faithfully represented as a

WXS PSVI CLICK!

42

42 Saturday, 29 October 2011

slide-43
SLIDE 43

Considerations

  • For “same system”

– Roundtripping (both ways) should be exact – Same program should behave the same in similar conditions

  • For homogenous, distinct systems

– Roundtripping (both ways) should be exact – Same program should behave the same in similar conditions – Interop!

  • For heterogenous systems

– Roundtripping should be reasonable – Analogous programs should behave analogously

  • in analogous conditions

– Weaker notion of interop

43

43 Saturday, 29 October 2011

slide-44
SLIDE 44

What is an XML “Document”?

  • Layers

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

Errors here mean no XML! SAX ErrorHandler Yay! XPath! XSLT! Etc. Types in play

44

44 Saturday, 29 October 2011

slide-45
SLIDE 45

What is an XML “Document”?

  • Layers

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

validate erase

45

45 Saturday, 29 October 2011

slide-46
SLIDE 46

What is an XML “Document”?

  • Layers

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

“Same” inputs can have different “meanings”! (external validation)

46

46 Saturday, 29 October 2011

slide-47
SLIDE 47

What is an XML “Document”?

  • Layers

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

Generally looks like

<configuration xmlns="http://saxon.sf.net/ns/configuration" edition="EE"> <serialization method="xml" /> </configuration>

But can look otherwise!

element configuration { attribute edition {"ee"}, element serialization {attribute method {"xml"}}}

Same “meaning”, different spelling

47

47 Saturday, 29 October 2011

slide-48
SLIDE 48

What is an XML “Document”?

  • Layers

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

– A picture (or document, or action, or…)

  • Application meaning

Can have many... ..for “the same” meaning

48

48 Saturday, 29 October 2011

slide-49
SLIDE 49

49

  • Thesis:

– “XML is touted as an external format for representing data.”

  • Two properties

– Self-describing

  • Destroyed by external validation

– Round-tripping

  • Destroyed by defaults and union types

http://bit.ly/essenceOfXML2

The Essence of XML

49 Saturday, 29 October 2011

slide-50
SLIDE 50

Self-description

  • As standard descriptoin

– A series of octets – A series of unicode characters – A series of “events”

  • SAX perspective
  • E.g., Start/End tags
  • Events are tokens

– A tree structure

  • A DOM/Infoset

– A tree of a certain shape

  • A Validated Infoset

– An adorned tree of a certain shape

  • A PSVI wrt an WXS

– A picture (or document, or action, or…)

  • Application meaning

Well-formed Only one way to parse it Internal (DTD and doc are one)

50

External (Schema and doc are separate;

  • ut-of-band desription)

50 Saturday, 29 October 2011

slide-51
SLIDE 51

Roundtripping Fail: Defaults

51 <a> <b/> <b c="bar"/> </a> Test.xml <!ELEMENT a (b)+> <!ELEMENT b EMPTY> <!ATTLIST b c CDATA #IMPLIED> sparse.dtd <!ELEMENT a (b)+> <!ELEMENT b EMPTY> <!ATTLIST b c CDATA 'foo'> full.dtd

count(//@c) = 2 count(//@c) = 1

<a> <b c="foo"/> <b c="bar"/> </a> Test-full.xml <a> <b/> <b c="bar"/> </a> Test-sparse.xml

Validate Serialize Query Can we think of Test-sparse and -full as “the same”?

Note: In oXygen, one needs to use internal validation.

51 Saturday, 29 October 2011

slide-52
SLIDE 52

Not self-describing!

  • Under external validation
  • Not just legality, but content!

– The PSVIs have different information in them!

52

52 Saturday, 29 October 2011

slide-53
SLIDE 53

Roundtripping “Success”: Types

53 <a> <b/> <b/> </a> Test.xml

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"> <xs:complexType> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="b"/> </xs:schema>

bare.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"/> <xs:complexType name="atype"> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="b" type="btype"/> <xs:complexType name="btype"/> </xs:schema>

typed.xsd

count(//b) = 2 count(//b) = 2

Validate Query

Note: In oXygen, one needs to use internal validation. Note: WXS can do default attributes as well.

53 Saturday, 29 October 2011

slide-54
SLIDE 54

Roundtripping “Success”: Types

54 <a> <b/> <b/> </a> Test.xml

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"> <xs:complexType> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="b"/> </xs:schema>

bare.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"/> <xs:complexType name="atype"> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="b" type="btype"/> <xs:complexType name="btype"/> </xs:schema>

typed.xsd

count(//b) = 2 count(//b) = 2

Validate Query2

Note: In oXygen, one needs to use internal validation. Note: WXS can do default attributes and elements as well.

count(//element(*,btype)) = ? count(//element(*,btype)) = 2

54 Saturday, 29 October 2011

slide-55
SLIDE 55

Roundtripping “Success”: Types

55 <a> <b/> <b/> </a> Test.xml

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"> <xs:complexType> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="b"/> </xs:schema>

bare.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"/> <xs:complexType name="atype"> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="b" type="btype"/> <xs:complexType name="btype"/> </xs:schema>

typed.xsd

count(//b) = 2 count(//b) = 2

Validate Query2

Note: In oXygen, one needs to use internal validation. Note: WXS can do default attributes as well.

count(//element(*,btype)) = ? count(//element(*,btype)) = 2

55 Saturday, 29 October 2011

slide-56
SLIDE 56

Roundtripping “Success”: Types

56 <a> <b/> <b/> </a> Test.xml

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"> <xs:complexType> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="b"/> </xs:schema>

bare.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="a"/> <xs:complexType name="atype"> <xs:sequence> <xs:element ref="b" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="b" type="btype"/> <xs:complexType name="btype"/> </xs:schema>

typed.xsd

count(//b) = 2 count(//b) = 2

<a> <b/> <b /> </a> Test.xml

Validate Serialize Query2

Note: In oXygen, one needs to use internal validation. Note: WXS can do default attributes as well.

count(//element(*,btype)) = ? count(//element(*,btype)) = 2

Does external through internal succeed? Does internal through external succeed?

56 Saturday, 29 October 2011

slide-57
SLIDE 57

57

  • Type

– Internal to external and back

  • Take an element, foo, with content {“one”, “2”, 3}
  • It’s (simple) type is a list of union of integer and string
  • Serialize

– <foo>one 2 3</foo>

  • Parse and validate

– Content is {“one”, 2, “3”} – Key type info LOST » Silently » With only 1 schema

  • Spelling

– External to internal and back

  • “001” to 1 to “1”

– Whitespace and layout

http://bit.ly/essenceOfXML2

More Roundtripping Fail

57 Saturday, 29 October 2011

slide-58
SLIDE 58

58

  • Conclusion:

– “So the essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.”

  • Itʼs not obvious

– That the issues are serious (enough) – That the problem solved is all that easy – That there arenʼt other, worse issues

http://bit.ly/essenceOfXML2

The Essence of XML

58 Saturday, 29 October 2011

slide-59
SLIDE 59

The Essence of Error

Or, so wrong it’s right

59

59 Saturday, 29 October 2011

slide-60
SLIDE 60

How to cope?

  • With which task?

– Authoring, aggregating, querying…

  • Settle on a core representation of the model

– Perhaps the Atom DOM

  • Coerce/transform/extract other models

– To the representative one – Or build software that mediates the difference

  • Hope that there aren’t too many
  • Advocate standards!

– Or make them – The nice thing about standards is that there are so many of them to choose from.

  • Kent Pitman and others

60 Saturday, 29 October 2011

slide-61
SLIDE 61

Postel’s Law

  • Liberality

– Many DOMs, all expressing the same thing – Many surface syntaxes (perhaps) for each DOM

  • Conservativity

– What should we send?

  • It depends on the receiver!

– Minimal standards?

  • Well formed XML?
  • Valid according to a popular schema/format?
  • HTML?

Be liberal in what you accept, and conservative in what you send.

61 Saturday, 29 October 2011

slide-62
SLIDE 62

62

Structure and Presentation

  • We’ve called this “DOM” and “Application” Layer

– A very common application layer is “rendering”

  • Text, images
  • Like, y’know, the web
  • Standard vs. default renderings
  • Goes back to SGML

<sentence style="slanted">This sentence is false.</sentence> This sentence is false. Correct rendering This sentence is false. Fallback!

(Still see this in XSLT!)

62 Saturday, 29 October 2011

slide-63
SLIDE 63

63

Why Separate them?

  • Presentation is more fluid than structure

– The "look" may need updating

  • Presentation needs may vary

– What works for 21" screens doesn't for mobile phones

  • (Or maybe not!)
  • Accessibility

– (content should be perceivable by everyone)

  • Programmatic processing needs

63 Saturday, 29 October 2011

slide-64
SLIDE 64

64

Another digression: CSS

  • The style language for the Web

– Strong separation of presentation

  • CSS is

– not an XML/angle brackets format

  • Oh NOES! Not another one!

– annotative, not transformative

  • Well, sorta

– mostly “formats” nodes – ubiquitous on the Web, esp. client side – works with arbitrary XML

  • But most clients work with (X)HTML
  • See the excellent PrinceXML formatter

64 Saturday, 29 October 2011

slide-65
SLIDE 65

65

Basic Component

  • Rules

– Which consist of

  • Selectors

– Like XPath expressions – But only forward, with some syntactic sugar

  • Declaration blocks

– Sets of property/value pairs

div.title { text-align:center; font-size: 24; }

65 Saturday, 29 October 2011

slide-66
SLIDE 66

66

<html><head><title>A bit of style</title></head> <body><style type="text/css"> .title { font-weight: bold } div.title { text-align:center; font-size: 24; } div.entry div.title { text-align: left; font-variant: normal} span.date {font-style: italic} span.date:after {content:" by"} div.content {font-style: italic} div.content i {font-style: normal;font-weight: bold} #one {color: red}</style> <div class="title">My Weblog</div> <div class="entry"> <div class="title">What I Did Today</div> <div class="byline"> <span class="date">Feb. 09, 2009</span> <span class="author">Bijan Parsia</span> </div> <div class="content" id="one"> <p>Taught a class and it went <i>very</i> well.</p> </div> </div> </body></html>

Try it in http://software.hixie.ch/utilities/js/live-dom-viewer/

66 Saturday, 29 October 2011

slide-67
SLIDE 67

67

Media Types

  • Different sets of rules can be contextualized to

media

– Screen, Print, Braille, Aural…

  • This is done with groupings called “@media rule”s

@media print { BODY { font-size: 10pt } } @media screen { BODY { font-size: 12pt } }

Larger font size for screen

67 Saturday, 29 October 2011

slide-68
SLIDE 68

68

Cascading

  • CSS Rules cascade

– That is, there is overriding (and non-overriding) inheritance

  • That is, rules combine in different ways

– http://www.w3.org/TR/CSS21/cascade.html#cascade

  • General principles

– Distance to the node is significant – Precision of selectors is significant – Order of appearance is significant

68 Saturday, 29 October 2011

slide-69
SLIDE 69

69

Error Handling

  • XML has “draconian” error handling

– Well formedness error…BOOM

  • CSS has “forgiving” error handling

– “Rules for handling parsing errors”

http://www.w3.org/TR/CSS21/syndata.html#parsing-errors

  • That is, how to interpret illegal documents
  • Not reporting errors, but working around them

– E.g.,“User agents must ignore a declaration with an unknown property.”

  • Replace: “h1 { color: red; rotation: 70minutes }”
  • With: “h1 { color: red }”
  • Study the error handling rules!

69 Saturday, 29 October 2011

slide-70
SLIDE 70

70

CSS Robustness

  • Has to deal with Web conditions
  • 1. People borrowing
  • 2. People collaborating
  • 3. Different devices
  • 4. Different kinds of audiences (and authors)
  • 5. Maintainability
  • 6. Aesthetics
  • CSS is designed for this

– Cascading & Inheritance help with 1, 2, 5

  • And importing, of course

– @media rules help with 3-6 – Error handling helps with 1, 2, 4

70 Saturday, 29 October 2011

slide-71
SLIDE 71

Errors!

  • One person’s error is another’s data
  • Errors may or may not be unusual
  • Errors are relative to a norm
  • Preventing errors

– Make errors hard or impossible to make

  • Make doing things hard or impossible

– Make doing the right thing easy and inevitable – Make detecting errors easy – Make correcting errors easy – Correct errors – Fail silently – Fail randomly – Fail differently (interop problem)

71

71 Saturday, 29 October 2011

slide-72
SLIDE 72

(Perceived) Affordances

  • (Perceived) Affordance

– an available action that is salient to the actor

Donald Norman, The Design of Everyday Things

72 Saturday, 29 October 2011

slide-73
SLIDE 73

(Perceived) Affordances

  • (Perceived) Affordance

– an available action that is salient to the actor

Donald Norman, The Design of Everyday Things

73 Saturday, 29 October 2011

slide-74
SLIDE 74

Attractive Nuisances

  • A dominant or attractive affordance

– with a bad or wrong action – In law, “a hazardous object or condition on the land that is likely to attract children who are unable to appreciate the risk posed by the object or condition” -- ye olde Wikipedia – We can reformulate

  • “a hazardous or misleading language or UI feature that is likely to be

misused by (even) an educated user”

  • Contrast with “merely” hard to use

– An attractive nuisance is easy to attempt, hard to use (correctly), and has bad (to catastrophic) effects

74 Saturday, 29 October 2011

slide-75
SLIDE 75

Typical Schema Languages

  • Grammar (and maybe type based)

– Recognize all or none

  • Though what the “all” is can be rather flexible

– Restrictive by default

  • Slogan: What is not permitted is forbidden

– Error detection and reporting

  • Is at the discretion of the system
  • “Not accepted” is the starting place
  • The point where an error is detected

– might not be the point where it occurred – might not be the most helpful point to look at!

  • Programs!

– Null pointer deref » Is the right point the deref or the setting to null? – Non-crashing errors

75 Saturday, 29 October 2011

slide-76
SLIDE 76

The SSD Way

  • Explore before prescribe
  • Describe rather than define
  • Take what you can, when you can take it
  • Extra or missing stuff is (can be) OK

– Irregular structure!

  • Adhere to the task at hand
  • Adore Postel’s Law

76 Saturday, 29 October 2011

slide-77
SLIDE 77

XML Error Handling

  • De facto XML motto

– Be strict about the well formedness of what you accept, and strict in what you send – Draconian error handling – Severe consequences on the Web

  • And other places
  • Fail early and fail hard
  • What about higher levels?

– Validity and other analysis? – Most schema languages poor at error reporting

  • How about XQuery’s type error reporting?

77 Saturday, 29 October 2011

slide-78
SLIDE 78

XML Error Handling

  • The spec:

– fatal error [Definition: An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).]

  • What should an application do?

– To or for its users

78 Saturday, 29 October 2011

slide-79
SLIDE 79

XPath for Validation

  • What XPath is “equivalent” to the declaration of <b>?

<a> <b/> <b/> <b/> </a> valid.xml <!ELEMENT a (b)+> <!ELEMENT b EMPTY> simple.dtd <a> <b/> <b>Foo</b> <b><b/></b> </a> invalid.xml

count(//b) count(//b/*) count(//b/text()) =3 =3 =0 =1 =0 =1

<a> <b/> <b>Foo</b> </a>

=0

<a> <b/> <b><b/><b/> </a>

=0

79 Saturday, 29 October 2011

slide-80
SLIDE 80

XPath for Validation

  • What XPath is “equivalent” to the declaration of <b>?

<a> <b/> <b/> <b/> </a> valid.xml <!ELEMENT a (b)+> <!ELEMENT b EMPTY> simple.dtd <a> <b/> <b>Foo</b> <b><b/></b> </a> invalid.xml

count(//b/(* | text()))

=0 =2

<a> <b/> <b>Foo</b> </a>

=1

<a> <b/> <b><b/><b/> </a>

=1

80 Saturday, 29 October 2011

slide-81
SLIDE 81

XPath for Validation

  • What XPath is “equivalent” to the declaration of <b>?

<a> <b/> <b/> <b/> </a> valid.xml <!ELEMENT a (b)+> <!ELEMENT b EMPTY> simple.dtd <a> <b/> <b>Foo</b> <b><b/></b> </a> invalid.xml

if (count(//b/(* | text()))=0) then “valid” else “invalid”

=valid =invalid

<a> <b/> <b>Foo</b> </a> <a> <b/> <b><b/><b/> </a>

Can even “find” the errors!

81 Saturday, 29 October 2011

slide-82
SLIDE 82

82 Saturday, 29 October 2011

slide-83
SLIDE 83

XPath (etc) for Validation

  • We could have finer control

– Validate parts of a document – A la wildcards

  • But with more control!
  • We could have greater expressivity

– Far reaching dependancies – Computations

  • Essentially, code based validation!

– With XQuery and XSLT – But still a leetle declarative

  • We always need it

The essence of Schematron

83 Saturday, 29 October 2011

slide-84
SLIDE 84
  • A different sort of schema language

– Not grammar or object/type based – Rule based – Test oriented – Complimentary

  • Conceptually simple

– Patterns contain rules

  • Rules set a context and contain asserts and reports (A&Rs)
  • A&Rs contain

– Tests, which are XPath expressions, and – Assertions, which are natural language descriptions

Schematron

84 Saturday, 29 October 2011

slide-85
SLIDE 85

DTDx Schematron

  • “Only 1 Element declaration with a given name”

– (Ok, could handle this with Keys in XML Schema!)

<rule context="element">

<let name="n" value="@name"/> <assert test="count(//element/name[text()=$n]) = 1"> There can be only one element declaration with a given name. </assert> </rule>

  • “Every element reference must have a corresponding element

declaration ”

<rule context="elementref"> <let name="r" value=”/ref/text()"/> <assert test="count(//element/nametext()=$r]) = 1"> There must be an element declaration (with the right name) for elementref to refer to. </assert> </rule>

85 Saturday, 29 October 2011

slide-86
SLIDE 86

From HTML5: Exclusions

  • HTML5 validator
  • http://hsivonen.iki.fi/thesis/

–Relax NG schema –Schemetron assertions –Custom code

  • Often want contextual exclusions

–To break circles:

  • Paragraphs contain footnotes
  • Footnotes contain paragraphs
  • Footnote paragraphs may not contain footnotes
  • Without exclusions, would need many

paragraph productions

86 Saturday, 29 October 2011

slide-87
SLIDE 87

Exclusions Examples

<schema xmlns="http://purl.oclc.org/dsdl/schematron"> <ns prefix="h" uri="http://www.w3.org/1999/xhtml"/> <pattern name='dfn cannot nest'> <rule context="h:dfn"> <report test="ancestor::h:dfn"> The "dfn" element cannot contain any nested "dfn" elements.</report> </rule> </pattern> <pattern name='noscript cannot nest'> <rule context="h:noscript"> <report test="ancestor::h:noscript"> The "noscript element cannot contain any nested "noscript" elements.</report> </rule> </pattern> </schema>

87 Saturday, 29 October 2011

slide-88
SLIDE 88

Tip of the iceberg

  • Computations

–Using XPath functions and variables

  • Dynamic checks

–Can pull stuff from other file

  • Elaborate reports

–diagnostics has (value-ofed) expressions –“Generate paths” to errors

  • Sound familiar?
  • General case

–Thin shim over XSLT –Closer to “arbitrary code”

88 Saturday, 29 October 2011

slide-89
SLIDE 89

Interesting Points

  • DTDx has a WXS

– Schematron doesn’t care – Two phase validation

  • RELAX NG has a way of embedding
  • WXS 1.1 incorporating similar rules
  • Arbitrary XPath for context and test

– Plus variables!

  • What isn’t forbidden is permitted

– Unlike all the other schema languages! – We’re not performing runs

  • We’re firing rules

– Somewhat easy to use

  • If you know XPath
  • If you don’t need coverage

– What about analysis?

89 Saturday, 29 October 2011

slide-90
SLIDE 90

Schematron Presumes…

  • …well formed XML

–As do all XML schema languages

  • Work on DOM!

–So can’t help with e.g., overlapping tags

  • Or tag soup in general
  • Namespace Analysis!?
  • …authorial repair

–At least, in the default case

  • Communicate errors to people
  • Thus, not the basis of a modern browser!

–Unlike CSS

  • Is this enough liberality?

–Or rather, does it support enough liberality?

90 Saturday, 29 October 2011

slide-91
SLIDE 91

Take the following sample XHTML code:

  • 01. <html>
  • 02. <head>

03. <title>Hello!</title> 04. <meta http-equiv="Content-Type" content="application/xhtml+xml" />

  • 05. </head>
  • 06. <body>

07. <p>Hello to you!</p> 08. <p>Can you spot the problem?

  • 09. </body>
  • 10. </html>

91 Slide due to Iain Flynn

91 Saturday, 29 October 2011

slide-92
SLIDE 92

HTML: XHTML:

92 Slide due to Iain Flynn

92 Saturday, 29 October 2011

slide-93
SLIDE 93

Validation In The Wild

  • HTML

– 1%-5% of web pages are valid – Validation is very weak! – All sorts of breakage

  • E.g., overlapping tags
  • <b>hi <i>there</b>, my good friend</i>
  • Syndication Formats

– 10% feeds not well-formed – Where do the problems come from?

  • Hand authoring
  • Generation bugs
  • String concat based generation
  • Composition from random sources

93 Saturday, 29 October 2011

slide-94
SLIDE 94

More recently

In 2005, the developers of Google Reader (Google’s RSS and Atom feed parser) took a snapshot of the XML documents they parsed in one day.

  • Approximately 7% of these documents contained at

least one well-formedness error.

  • Google Reader deals with millions of feeds per day.

– That’s a lot of broken documents

Source: http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html Slide due to Iain Flynn

94 Saturday, 29 October 2011

slide-95
SLIDE 95

Text

Encoding Structure Entity Typo

Slide due to Iain Flynn

95 Saturday, 29 October 2011

slide-96
SLIDE 96

!"#$%&"'() *+,)

  • ./0#.0/1()

23,) !"4.5() **,) 657$() 2,)

!""#"$%"&'()#*+,$

!"#$%&"'()

  • ./0#.0/1()

!"4.5() 657$()

Slide due to Iain Flynn

96 Saturday, 29 October 2011

slide-97
SLIDE 97

A Thought Experiment

  • “Imagine...that all web browsers use strict XML parsers”
  • “...that you were using a publishing tool that [was strict]

– “All of its default templates were valid XHTML.” – “It incorporated a nifty layout editor to ensure that you couldn’t introduce any invalid XHTML...”

  • “You click ‘Publish’”

– “the page that you...validly authored is now not well-formed”

  • Problem: “a trackback with some illegal characters”

– “...your publishing tool had a bug” – “The administration page itself tries to display the trackbacks you’ve received, and you get an XML processing error.”

http://diveintomark.org/archives/2004/01/14/thought_experiment

97 Saturday, 29 October 2011

slide-98
SLIDE 98

Real Life

98 Saturday, 29 October 2011

slide-99
SLIDE 99

Lesson #1

  • We are dealing with socio-political (and economic) phenomena

– Complex ones! – Many players; many sorts of player – Lots of historical specifics – Lots of interaction effects

  • Human factors critical

– What do people do (and why?) – How to influence them? – Affordances and incentives – Dealing with “bozos”

  • “There’s just no nice way to say this: Anyone who can’t make a

syndication feed that’s well-formed XML is an incompetent fool.”

99 Saturday, 29 October 2011

slide-100
SLIDE 100

3 Error Handling Styles

  • Draconian

– Fail hard and fast

  • Ignore errors

– CSS, DTD ATTLISTs, HTML

  • Hard coded DWIM repair

– HTML, HTML5

  • Ultimately, (some) errors are propagated

– The key is to fail correctly

  • In the right way, at the right time, for the right reason

– With the right message!

  • Better is to make errors unlikely!

Every set of bytes has a corresponding (determinate) DOM

100 Saturday, 29 October 2011

slide-101
SLIDE 101

Wrap-up

Or, goodbyes and farewells

101

101 Saturday, 29 October 2011

slide-102
SLIDE 102

Semi-structured Data

  • There’s a tension between

– flexibility and stability – flexibility and efficiency – expressivity and efficiency – usability and flexibility – usability and rigidity – etc. etc. etc.

  • It is important to

– understand trade-offs – cultivate judgement

  • Most things can be made to work

– there is no silver bullet

  • Most things can fail

102 Saturday, 29 October 2011

slide-103
SLIDE 103

Last coursework

  • There’s the usual line up

– CW5 is “easy” :)

  • XQuery version of the calculator
  • (slightly extended)

– M5 is a bit of Schematron

  • COURSEWORK DEADLINE IS DIFFERENT

– Due MONDAY, NOV 7TH!!! – At 9:00AM

  • Some “make up” work available

– Due after period 2 – So as not to conflict – Practice some Java!

103 Saturday, 29 October 2011

slide-104
SLIDE 104

The Exam

  • Electronic/Online

– Basically, an extended version of Qs and SEs

  • Revision session

– After break

  • Blackboard discussion area

– For revision

104 Saturday, 29 October 2011

slide-105
SLIDE 105

Thanks for playing

  • Uli and I enjoyed working with you
  • There are many possible projects that would build on

things you’ve learned; see me if you’re interested

105 Saturday, 29 October 2011