An Incremental Correction Algorithm for XML Documents and Single - - PowerPoint PPT Presentation
An Incremental Correction Algorithm for XML Documents and Single - - PowerPoint PPT Presentation
An Incremental Correction Algorithm for XML Documents and Single Type Tree Grammars Martin Svoboda, Irena Mlnkov XML and Web Engineering Research Group Charles University in Prague The Czech Republic 24 April 2012 NDT 2012 Dubai, United
An Incremental Correction Algorithm for XML Documents 2 NDT 2012, Dubai, UAE 24 April 2012
Outline
- Introduction
- Motivation
- Objectives
- Approach
- Corrections
- Algorithms
- Experiments
- Conclusion
An Incremental Correction Algorithm for XML Documents 3 NDT 2012, Dubai, UAE 24 April 2012
Introduction
- Motivation
- Incorrect XML documents
‒ Well-formedness ‒ Schema validity ‒ Data consistency ‒ …
- Strategies
‒ Adjusting algorithms ‒ Correcting data
An Incremental Correction Algorithm for XML Documents 4 NDT 2012, Dubai, UAE 24 April 2012
Introduction
- Problem
- Input
‒ One XML document
- Well-formed but (potentially) invalid
‒ DTD or XML Schema
- Output
‒ All minimal repairs
- Structural corrections of elements
An Incremental Correction Algorithm for XML Documents 5 NDT 2012, Dubai, UAE 24 April 2012
Definitions
- Document
- Trees
‒ Nodes for elements and texts ‒ Prefix numbering of nodes
- Example
<a> <x><d/></x> <d><d/><d/></d> </a> a
ε
x d
1
d
1.0
d
1.1
d
0.0
An Incremental Correction Algorithm for XML Documents 6 NDT 2012, Dubai, UAE 24 April 2012
Definitions
- Schema
- Grammars
‒ Terminal symbols for element names ‒ Nonterminal symbols for types ‒ Production rules based on regular expressions
- Classes
‒ Regular tree grammars ‒ Single type tree grammars (XML Schema) ‒ Local tree grammars (DTD)
An Incremental Correction Algorithm for XML Documents 7 NDT 2012, Dubai, UAE 24 April 2012
Model
- Edit operations
- ADD leaf, REMOVE leaf, RENAME label
- Update operations
- Sequences of edit operations
- INSERT, DELETE, REPAIR, RENAME
- Cost function
- Unit costs of edit operations
An Incremental Correction Algorithm for XML Documents 8 NDT 2012, Dubai, UAE 24 April 2012
Model
a
ε
x d
1
d
1.0
d
1.1
d
0.0
a
ε
d
1
d
1.0
c d
2
d
2.0
d
2.1
a
ε
c d
1
d
1.0
d
1.1
b
ε
d d
1
d
1.0
d
1.1
d
0.0
d
Type Name Model A a C.D* B b D* C c empty D d D*
<a> <x><d/></x> <d> <d/><d/> </d> </a>
An Incremental Correction Algorithm for XML Documents 9 NDT 2012, Dubai, UAE 24 April 2012
Algorithm
- Naive algorithm
- Task
‒ At each level of top-down tree processing… …find repairs for a sequence of sibling nodes
- Steps
‒ Construct a repairing multigraph ‒ Recursively repair subtrees ‒ Compose a repairing structure
An Incremental Correction Algorithm for XML Documents 10 NDT 2012, Dubai, UAE 24 April 2012
Algorithm
DELETE RENAME REPAIR INSERT a
ε
x d
1
d
1.0
d
1.1
d
0.0
00 10 20 01 11 21 02 12 22
x d 1 2
Type Name Model A a C.D* B b D* C c empty D d D*
An Incremental Correction Algorithm for XML Documents 11 NDT 2012, Dubai, UAE 24 April 2012
Algorithm
RENAME REPAIR INSERT
00 10 20 01 11 21 02 12 22
x d 1 2 a
ε
d
1
d
1.0
c d
2
d
2.0
d
2.1
a
ε
c d
1
d
1.0
d
1.1
d RENAME REPAIR
1 2 1
An Incremental Correction Algorithm for XML Documents 12 NDT 2012, Dubai, UAE 24 April 2012
Algorithms
- Naive
- Dynamic
- Directly follows Dijkstra’s algorithm and, thus, only
required multigraph parts are explored
- Caching
- Avoids repeated recursive computations by
detecting and caching identical repairs
- Incremental
- Evaluates repairing multigraphs step by step
An Incremental Correction Algorithm for XML Documents 13 NDT 2012, Dubai, UAE 24 April 2012
Algorithms
- Incremental
- Task
‒ Structure encapsulating multigraph evaluation
- Multigraph structure
- Dijkstra’s variables
- Scheduler
‒ Processing of an activated task:
- Request further refinement of perspective edges
- Activate corresponding tasks for nested problems
An Incremental Correction Algorithm for XML Documents 14 NDT 2012, Dubai, UAE 24 April 2012
Experiments
- Data
- Single type tree grammar
‒ 7 nonterminal symbols ‒ 6 terminal symbols ‒ Recursion, iteration
- XML data trees
‒ Maximal depth 5, fan-out 8 ‒ Elements from 100 to 1,000 ‒ 20 files for each particular size ‒ Average values from 20 repeats
An Incremental Correction Algorithm for XML Documents 15 NDT 2012, Dubai, UAE 24 April 2012
Experiments
- Execution time in miliseconds
10 20 30 40 200 400 600 800 1000
Elements Caching Incremental
An Incremental Correction Algorithm for XML Documents 16 NDT 2012, Dubai, UAE 24 April 2012
Experiments
- Number of correction intents
- Equals to a number of distinct multigraphs
1000 2000 3000 4000 200 400 600 800 1000
Elements Caching Incremental
An Incremental Correction Algorithm for XML Documents 17 NDT 2012, Dubai, UAE 24 April 2012
Conclusion
- Contributions
- Single type tree grammars
- Always all minimal repairs
- New incremental algorithm
- Advantages
- Compact repair structure
- Prototype implementation
Thank you for your attention…
XML and Web Engineering Research Group