The XML Typechecking Problem Dan Suciu, University of Washington - - PowerPoint PPT Presentation

the xml typechecking problem
SMART_READER_LITE
LIVE PREVIEW

The XML Typechecking Problem Dan Suciu, University of Washington - - PowerPoint PPT Presentation

The XML Typechecking Problem Dan Suciu, University of Washington Presented by T.J. Green University of Pennsylvania February 19, 2004 with L A T EX slides! XML Data Model Subset of XQuery data model: XML documents are ordered trees


slide-1
SLIDE 1

The XML Typechecking Problem

Dan Suciu, University of Washington Presented by T.J. Green ∗ University of Pennsylvania February 19, 2004

∗with L A

T EX slides!

slide-2
SLIDE 2

XML Data Model

Subset of XQuery data model: XML documents are ordered trees with labels at nodes. More precisely, fix an alphabet Σ of tag names, attribute names, and atomic type names. Denote TΣ the set of ordered trees where each node is labeled with an element from Σ.

1

slide-3
SLIDE 3

XML Types

A type is a subset of TΣ that is a regular tree language. Formally, a type is defined by a set of type identifiers T and associates to each identifier a regular expression over Σ × T.

2

slide-4
SLIDE 4

XML Types - an example

Here is an example, using XQuery’s syntax.

TYPE Catalog = ELEMENT catalog(Products) TYPE Products = (ELEMENT product(Product))* TYPE Product = (ATTRIBUTE name(STRING)?, (ELEMENT mfr-price(INTEGER) | ELEMENT sale-price(INTEGER))*, (ELEMENT color(STRING))*)

3

slide-5
SLIDE 5

Expressiveness of type formalism

Of course, this formalism does not capture all the details of a real XML type system. But it is actually more powerful than XML Schema or DTD’s in

  • ne respect.

4

slide-6
SLIDE 6

Expressiveness of type formalism

Consider the set of pairs (σ, t) ∈ Σ × T that occur in the regular expression for some type identifier. XML Schema requires that σ be a key in this collection. DTD requires that σ be a key in the entire collection of pairs in all regular expressions.

5

slide-7
SLIDE 7

Regular tree languages

Regular tree languages extensively studied for ranked trees (i.e., where the number of children of a node is fixed). But XML data model is unranked.

6

slide-8
SLIDE 8

Modified regular tree languages

Various equivalent modifications can handle this (extending tree automata to unranked trees; using specialized DTD’s; mapping unranked trees into ranked binary trees; defining types as in XDuce or XQuery). Here we use the XQuery style regular types.

7

slide-9
SLIDE 9

Containment of regular tree languages

Key property of regular tree languages: given two types τ1, τ2, can check whether τ1 ⊆ τ2. High complexity in general (EXPTIME-complete). But in PTIME if τ2 corresponds to a deterministic tree automaton.

8

slide-10
SLIDE 10

The Validation Problem

Given a tree t ∈ TΣ and a type τ, decide whether t ∈ τ. But what if instead of a document, we are given a program whose output is an XML document?

9

slide-11
SLIDE 11

The Typechecking Problem

Given a program P defining a function P : D → TΣ, where D is the program’s input domain, and a type τ ⊆ TΣ. Decide whether ∀x ∈ D, P(x) ∈ τ.

10

slide-12
SLIDE 12

The Typechecking Problem

So the typechecker analyzes the program and decides whether all documents produced by the program are valid, and returns yes or no. If no, we would also like to know where in the program type- checking failed. (May be hard though.)

11

slide-13
SLIDE 13

The Typechecking Problem

Typechecking may not even be possible, in which case we may need to settle for an incomplete typechecker, which may reject some programs that in fact do typecheck.

12

slide-14
SLIDE 14

The Type Inference Problem

A kind of dual of the typechecking problem. Given a program P, compute the type P(D) = {P(x) | x ∈ D}. Again, perfect type inference may not be possible, and we may need to settle for incomplete type inference.

13

slide-15
SLIDE 15

What kind of programs?

We consider two different kinds of programs, depending on the application.

14

slide-16
SLIDE 16

Application 1 - XML Publishing

Here the XML document is a view over a relational database. The program’s domain is D = Inst(S), the set of all database instances of some schema, S. S may contain key and foreign key constraints. P may perform only simple select-project-join queries on the database, nest the results, and add appropriate XML tags.

15

slide-17
SLIDE 17

Application 1 - XML Publishing

Consider some database whose schema S is defined as follows.

product(pid:STRING, name:STRING, mfrprice:INTEGER), colors(cid:STRING, pid:STRING, color:STRING), sale(sid:STRING, pid:STRING, price:INTEGER)

First attribute of each relation is a key. Foreign key constraints suggested by attribute names.

16

slide-18
SLIDE 18

Application 1 - XML Publishing

Now, here is an example of an XQuery program that produce an XML view of this database.

<catalog> { FOR $p in $db/product/tuple RETURN <product name = { data($p/name) }> <mfr-price> { data($p/price)} </mfr-price> { FOR $s in $db/sale/tuple WHERE $p/pid = $s/pid RETURN <sale-price> { data($s/sprice) } </sale-price> } { FOR $c in $db/color/tuple WHERE $p/pid = $c/pid RETURN <color> { data($c/color) } </color> </product> } </catalog>

17

slide-19
SLIDE 19

Application 2 - XML Transformations

The other class of applications we consider is those which require XML Transformations. Here, the program’s input is an XML document, that is, the program’s domain D is either TΣ or some XML type τ. The output is another XML document.

18

slide-20
SLIDE 20

Application 2 - XML Transformations

We take as our programming language a restricted fragment

  • f XSLT that includes:
  • recursive templates
  • modes
  • apply-template can be called along any XPath axis
  • variables can be bound to nodes in the input atree, then passed as pa-

rameters

  • an equality test can performed between node ID’s, but not between node

values

19

slide-21
SLIDE 21

Application 2 - XML Transformations

We can formalize this language in terms of k-pebble tree trans-

  • ducers. That formalism is beyond the scope of this talk.

20

slide-22
SLIDE 22

Type Checking or Type Inference?

One way to perform typechecking is by using type inference: infer the output type τ1 of the program, and check for contain- ment within the desired output type τ1 ⊆ τ2. We’ll first consider type inference.

21

slide-23
SLIDE 23

Type Inference

Consider the XQuery program shown a few slides back. We humans can infer its output type as

TYPE T1 = ELEMENT catalog(T2) TYPE T2 = (ELEMENT product(T3))* TYPE T3 = ATTRIBUTE name(STRING), ELEMENT mfr-price(INTEGER), (ELEMENT sale-price(INTEGER))*, (ELEMENT color(STRING))*

How? catalog tag at root is obvious (T1). Several product chil- dren (T2). Analyze RETURN clause: product has exactly one name attribute, one mfr-price child, and several sale-price and color children.

22

slide-24
SLIDE 24

Type Inference

More programmatically, the general idea is that one infers the type of a RETURN expression from the types of its components. The XQuery formal semantics applies this to the entire language by providing type inference rules for each language construct. Type inference is used to perform typechecking in XQuery.

23

slide-25
SLIDE 25

Type Inference

For the XML publishing application, we actually need an en- hancement to make use of key and foreign key constraints in

  • rder to infer the correct output type.

For example, knowing that pid is also a key for sale (each prod- uct has at most one sale price) narrows T3 by replacing

(ELEMENT sale-price(INTEGER))* with (ELEMENT sale-price(INTEGER))?.

24

slide-26
SLIDE 26

Limtations of Type Inference

Suppose the the relational schema has a single table, R(x,y), and the XQuery program is:

<result> { FOR $x in $db/R/tuple RETURN <a/>, FOR $x in $db/R/tuple RETURN <b/> } </result>

25

slide-27
SLIDE 27

Limitations of Type Inference

XQuery infers its output type as

TYPE T = ELEMENT result((ELEMENT a)*, (ELEMENT b)*)

but the real output type is:

P(D) = {ELEMENT result((ELEMENT a)n, (ELEMENT b)n) | n ≥ 0}

since we have the same number of a’s and b’s. But obviously this is not a regular tree language, so we cannot hope to infer it, and must settle for T instead.

26

slide-28
SLIDE 28

Limitations of Type Inference

But T is an ad-hoc choice, and now we incorrectly fail to type- check with respect to the output type

T1 = ELEMENT result() | ELEMENT result(ELEMENT a, (ELEMENT a)*, ELEMENT b, (ELEMENT b)*

The program in reality typechecks to this type, because T1 just rules out the cases of (0 a’s, 1+ b’s) or (1+ a’s, 0 b’s). Yet the type-checker rejects it, because T1 ⊆ T.

27

slide-29
SLIDE 29

Typechecking

Given these limitations, maybe we can do better trying to do typechecking without type inference? Indeed, given certain restrictions on the programming language and output type, it is possible.

28

slide-30
SLIDE 30

Typechecking for XML Publishing

Here is an algorithm for typechecking P against τ: enumerate all “small” input databases (up to a size which depends only on P and τ); run P on each; check that the output conforms to τ. Not the most efficient algorithm, but it works*!

29

slide-31
SLIDE 31

Typechecking for XML Publishing

*Actually, two restrictions on the output type τ are required:

  • τ must be a DTD type
  • all regular expressions in τ must be “star-free”

30

slide-32
SLIDE 32

Aside: star-free regular expressions

Star-free means no Kleene closure, but can use the comple- ment, compl, and the empty set, ∅. This gives something Kleene closure-like, which in fact can express all examples given so far in this talk. For example, if Σ = {a, b, c}, then compl(∅) denotes Σ∗, and compl(Σ∗.b.Σ∗ | Σ∗.c.Σ∗) denotes a∗. But, not all Kleene closure expressions can be expressed this way. An example that cannot: (a.a)∗.

31

slide-33
SLIDE 33

Limitations of Typechecking

Unfortunately, the restrictions we have given are critical. Allowing output types that are not DTD’s or increasing the ex- pressive power of the language leads to undecidability.

32

slide-34
SLIDE 34

Typechecking for XML Transformation

In this application we take as our programming language the fragment of XSLT described in an earlier slide. Here we can do unexpectedly well by exploiting inverse type in- ference. Let P : TΣ → TΣ be a transformation expressed in the language and let τ be a type τ ⊆ TΣ. Consider the inverse type P −1(τ) = {x | P(x) ∈ τ}. Surprisingly, one can show that P −1 is also a regular language!

33

slide-35
SLIDE 35

Typechecking for XML Transformation

Theorem: for any program P in the XSLT fragment defined earlier, and any regular tree language τ, P −1(τ) is also a regular tree language. In consequence, typechecking for this language is decidable.

34

slide-36
SLIDE 36

Limitations of Typechecking

Again this result depends very much on the restrictions we have placed on the XSLT fragment. Adding joins to the language (by allowing comparisons between parameters’ values), for example, leads to undecidability.

35

slide-37
SLIDE 37

Summary

Problem of great practical interest. Results a bit depressing so far. Typechecking approach seems limited in its potential. “The decidable cases seem to be more the exception than the rule. Moreover, even where typechecking is decidable, the complexity is high.” Type inference approach seems more promising. While in theory it is limited by its incompleteness, it seems to work fine in prac-

  • tice. Counterexample we saw earlier is arguably too contrived to

arise in practice.

36

slide-38
SLIDE 38

Open Problems

  • Can we develop precise notion of “practical” types for which

type inference is complete?

  • Can we issue warnings when type inference fails? (i.e., return

unknown instead of false negatives)

  • Can we perhaps develop practical algorithms for typechecking

using approximation or randomized techniques? (Approxima- tion aside from type inference, that is.)

37

slide-39
SLIDE 39

References

  • paper (“The XML Typechecking Problem”, Dan Suciu, SIG-

MOD Record, March 2002: http://portal.acm.org/citation.cfm?id=507360&dl=ACM&coll=portal

  • slides for this talk:

http://www.cis.upenn.edu/ tjgreen

38