Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 - - PowerPoint PPT Presentation

approximate joins for data centric xml
SMART_READER_LITE
LIVE PREVIEW

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 - - PowerPoint PPT Presentation

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 Curtis Dyreson 2 Johann Gamper 1 Michael B 1 Free University of Bozen-Bolzano Bolzano, Italy { augsten,boehlen,gamper } @inf.unibz.it 2 Utah State University Logan, UT, U.S.A.


slide-1
SLIDE 1

Approximate Joins for Data-Centric XML

Nikolaus Augsten1 Michael B¨

  • hlen1

Curtis Dyreson2 Johann Gamper1

1Free University of Bozen-Bolzano

Bolzano, Italy {augsten,boehlen,gamper}@inf.unibz.it

2Utah State University

Logan, UT, U.S.A. curtis.dyreson@usu.edu

April 10, 2008 ICDE, Canc´ un, Mexico

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 1 / 33

slide-2
SLIDE 2

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 2 / 33

slide-3
SLIDE 3

Motivation

Approximate Join on Music CDs

Song Lyric Store CD Warehouse

album track title

So Far

artist

Mark

track artist

Roger

title

Breathe

year

2000

album track artist

Neil

title

Alabama

price

10

album track title

Alabama

artist

Neil

title

Harvest

album track artist

Roger

title

Breathe

price

15

track artist

Mark

title

So Far

Query: Give me all album pairs that represent the same music CDs.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 3 / 33

slide-4
SLIDE 4

Motivation

Approximate Join on Music CDs

Song Lyric Store CD Warehouse

album track title

So Far

artist

Mark

track artist

Roger

title

Breathe

year

2000

album track artist

Neil

title

Alabama

price

10

album track title

Alabama

artist

Neil

title

Harvest

album track artist

Roger

title

Breathe

price

15

track artist

Mark

title

So Far

Query: Give me all album pairs that represent the same music CDs. How similar are two XML items?

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 3 / 33

slide-5
SLIDE 5

Motivation

How Similar Are these XMLs?

album track title

So Far

artist

Mark

track artist

Roger

title

Breathe

year

2000

album track artist

Roger

title

Breathe

price

15

track artist

Mark

title

So Far

Standard solution O(n3): tree edit distance Minimum number of node edit operations (insert, delete, rename) that transforms one ordered tree into the other.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 4 / 33

slide-6
SLIDE 6

Motivation

How Similar Are these XMLs?

album track title

So Far

artist

Mark

track artist

Roger

title

Breathe

year

2000

album track artist

Roger

title

Breathe

price

15

track artist

Mark

title

So Far

Standard solution O(n3): tree edit distance Minimum number of node edit operations (insert, delete, rename) that transforms one ordered tree into the other. Problem: permuted subtrees are deleted/re-inserted node by node

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 4 / 33

slide-7
SLIDE 7

Motivation

Ordered vs. Unordered Trees

Ordered Trees sibling order matters a c e d b

=

a b c d e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 5 / 33

slide-8
SLIDE 8

Motivation

Ordered vs. Unordered Trees

Ordered Trees sibling order matters a c e d b

=

a b c d e ignore order Unordered Trees = data-centric XML sibling order ignored a b c e d

=

a b c e d

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 5 / 33

slide-9
SLIDE 9

Motivation

Ordered vs. Unordered Trees

Ordered Trees sibling order matters a c e d b

=

a b c d e ignore order Unordered Trees = data-centric XML sibling order ignored a b c e d

=

a b c e d Edit distance between unordered trees: NP-complete → all sibling permutations must be considered!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 5 / 33

slide-10
SLIDE 10

Motivation

Problem Definition

Find an effective distance for the approximate matching of hierar- chical data represented as unordered labeled trees that is efficient for approximate joins.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 6 / 33

slide-11
SLIDE 11

Motivation

Problem Definition

Find an effective distance for the approximate matching of hierar- chical data represented as unordered labeled trees that is efficient for approximate joins. Naive approaches that fail: unordered tree edit distance: NP-complete allow subtree move: NP-hard compute minimum distance between all permutations: O(n!) sort by label and use ordered tree edit distance: error O(n)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 6 / 33

slide-12
SLIDE 12

Windowed pq-Grams for Data-Centric XML

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 7 / 33

slide-13
SLIDE 13

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Our Solution: Windowed pq-Grams

Windowed pq-Gram: small subtree with stem and base

  • • •

stem

p = 2

base

q = 3

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 8 / 33

slide-14
SLIDE 14

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Our Solution: Windowed pq-Grams

Windowed pq-Gram: small subtree with stem and base

  • • •

stem

p = 2

base

q = 3

Key Idea: split unordered tree into set of windowed pq-grams that is

not sensitive to the sibling order sensitive to any other change in the tree

Intuition: similar unordered trees have similar windowed pq-grams

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 8 / 33

slide-15
SLIDE 15

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Our Solution: Windowed pq-Grams

Windowed pq-Gram: small subtree with stem and base

  • • •

stem

p = 2

base

q = 3

Key Idea: split unordered tree into set of windowed pq-grams that is

not sensitive to the sibling order sensitive to any other change in the tree

Intuition: similar unordered trees have similar windowed pq-grams Systematic computation of windowed pq-grams

  • 1. sort the children of each node by their label (works OK for pq-grams)
  • 2. simulate permutations with a window
  • 3. split tree into windowed pq-grams

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 8 / 33

slide-16
SLIDE 16

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Implementation of Windowed pq-Grams

Set of windowed pq-grams:

a b c d e − → * a b c * a b * * a c * * a c b * a * b * a * c a b * * a c d e a c d * a c e * a c e d a c * d a c * e c d * * c e * *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 9 / 33

slide-17
SLIDE 17

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Implementation of Windowed pq-Grams

Set of windowed pq-grams:

a b c d e − → * a b c * a b * * a c * * a c b * a * b * a * c a b * * a c d e a c d * a c e * a c e d a c * d a c * e c d * * c e * *

Hashing: map pq-gram to integer: * a b c

serialize

→ (*, a, b, c)

(shorthand)

→ *abc

hash

→ 0973

label l h(l) * a 9 b 7 c 3 . . . . . .

Note: labels may be strings of arbitrary length!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 9 / 33

slide-18
SLIDE 18

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

Implementation of Windowed pq-Grams

Set of windowed pq-grams:

a b c d e − → * a b c * a b * * a c * * a c b * a * b * a * c a b * * a c d e a c d * a c e * a c e d a c * d a c * e c d * * c e * *

Hashing: map pq-gram to integer: * a b c

serialize

→ (*, a, b, c)

(shorthand)

→ *abc

hash

→ 0973

label l h(l) * a 9 b 7 c 3 . . . . . .

Note: labels may be strings of arbitrary length! pq-Gram index: bag of hashed pq-grams I(T) = {0973, 0970, 0930, 0937, 0907, 0903, 9700, 9316, 9310, 9360, 9361, 9301, 9306, 3100, 3600} Tree is represented by a bag of integers!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 9 / 33

slide-19
SLIDE 19

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-20
SLIDE 20

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Pseudo-metric properties hold:

✓ self-identity: x = y ⇒ distpq(x, y) = 0

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-21
SLIDE 21

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Pseudo-metric properties hold:

✓ self-identity: x = y / ⇐ ⇒ distpq(x, y) = 0

Different trees may be at distance zero: b b b b b b b b

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-22
SLIDE 22

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Pseudo-metric properties hold:

✓ self-identity: x = y / ⇐ ⇒ distpq(x, y) = 0 ✓ symmetry: distpq(x, y) = distpq(y, x)

Different trees may be at distance zero: b b b b b b b b

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-23
SLIDE 23

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Pseudo-metric properties hold:

✓ self-identity: x = y / ⇐ ⇒ distpq(x, y) = 0 ✓ symmetry: distpq(x, y) = distpq(y, x) ✓ triangle inequality: distpq(x, z) ≤ distpq(x, y) + distpq(y, z)

Different trees may be at distance zero: b b b b b b b b

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-24
SLIDE 24

Windowed pq-Grams for Data-Centric XML Windowed pq-Grams

The Windowed pq-Gram Distance

The windowed pq-gram distance between two trees, T and T′: distpq(T, T′) = |I(T) ⊎ I(T′)| − 2|I(T) ∩ I(T′)|

I(T) I(T′)

Pseudo-metric properties hold:

✓ self-identity: x = y / ⇐ ⇒ distpq(x, y) = 0 ✓ symmetry: distpq(x, y) = distpq(y, x) ✓ triangle inequality: distpq(x, z) ≤ distpq(x, y) + distpq(y, z)

Different trees may be at distance zero: b b b b b b b b Runtime for the distance computation is O(n log n).

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

slide-25
SLIDE 25

Windowed pq-Grams for Data-Centric XML Tree Sorting

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 11 / 33

slide-26
SLIDE 26

Windowed pq-Grams for Data-Centric XML Tree Sorting

Sorting the Tree?

Idea:

  • 1. sort the children of each node by their label
  • 2. apply an ordered tree distance

T1

a b c b e f g d h f i j k

sort

Tsrt

1

a b d e f g b f h i c k j

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33

slide-27
SLIDE 27

Windowed pq-Grams for Data-Centric XML Tree Sorting

Sorting the Tree?

Idea:

  • 1. sort the children of each node by their label
  • 2. apply an ordered tree distance

T1

a b c b e f g d h f i j k

sort

Tsrt

1

a b d e f g b f h i c k j

✘ Edit distance: tree sorting does not work

✓ Windowed pq-Grams: tree sorting works OK

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33

slide-28
SLIDE 28

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 1. Non-unique sorting:

a b c b e f g d h f i j k a b h f i b e f g d c j k

unordered edit dist = 0

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

slide-29
SLIDE 29

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 1. Non-unique sorting:

a b c b e f g d h f i j k a b h f i b e f g d c j k

a b d e f g b f h i c k j a b f h i b d e f g c k j

unordered edit dist = 0 sort sort

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

slide-30
SLIDE 30

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 1. Non-unique sorting: edit distance O(n) for identical trees

a b c b e f g d h f i j k a b h f i b e f g d c j k

a b d e f g b f h i c k j a b f h i b d e f g c k j

unordered edit dist = 0 sort sort

  • rdered

edit dist = O(n)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

slide-31
SLIDE 31

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 2. Node renaming:

T2

a

a

c b e f g d h f i j k

T2

a

b

c b e f g d h f i j k

1 rename

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

slide-32
SLIDE 32

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 2. Node renaming:

T2

a

a

c b e f g d h f i j k

T2

a

b

c b e f g d h f i j k

a a d e f g b f h i c k j a b d e f g b f h i c k j

1 rename dist = 1 sort sort

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

slide-33
SLIDE 33

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 2. Node renaming:

T2

a

a

c b e f g d h f i j k

T2

a

b

c b e f g d h f i j k

T2

a

x

c b e f g d h f i j k

a a d e f g b f h i c k j a b d e f g b f h i c k j a b f h i c k j x d e f g

1 rename dist = 1 sort sort 1 rename sort

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

slide-34
SLIDE 34

Windowed pq-Grams for Data-Centric XML Tree Sorting

✘ Edit Distance: Tree Sorting Does Not Work

  • 2. Node renaming: edit distance depends on node label

T2

a

a

c b e f g d h f i j k

T2

a

b

c b e f g d h f i j k

T2

a

x

c b e f g d h f i j k

a a d e f g b f h i c k j a b d e f g b f h i c k j a b f h i c k j x d e f g

1 rename dist = 1 sort sort 1 rename sort dist = O(n)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

slide-35
SLIDE 35

Windowed pq-Grams for Data-Centric XML Tree Sorting

✓ Windowed pq-Grams: Tree Sorting Works OK

Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O(k) windowed pq-grams change. Proof (idea):

pq-grams consist of a stem and a base stems are invariant to the sibling order bases: only the O(k) pq-grams with the reordered nodes in the bases change

  • • •

stem base

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33

slide-36
SLIDE 36

Windowed pq-Grams for Data-Centric XML Tree Sorting

✓ Windowed pq-Grams: Tree Sorting Works OK

Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O(k) windowed pq-grams change. Proof (idea):

pq-grams consist of a stem and a base stems are invariant to the sibling order bases: only the O(k) pq-grams with the reordered nodes in the bases change

  • • •

stem base

✓ Non-unique sortings are equivalent: distance is 0 for identical trees ✓ Node renaming is independent of the node label

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33

slide-37
SLIDE 37

Windowed pq-Grams for Data-Centric XML Forming Bases

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 16 / 33

slide-38
SLIDE 38

Windowed pq-Grams for Data-Centric XML Forming Bases

How To Form Bases?

Goal for windowed pq-grams:

  • • •

stem

p = 2

base

q = 3

not sensitive to the sibling order sensitive to any other change in the tree

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

slide-39
SLIDE 39

Windowed pq-Grams for Data-Centric XML Forming Bases

How To Form Bases?

Goal for windowed pq-grams:

  • • •

stem

p = 2

base

q = 3

not sensitive to the sibling order sensitive to any other change in the tree

Stems: ignore sibling order a b c d e − → * a a b a c c d c e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

slide-40
SLIDE 40

Windowed pq-Grams for Data-Centric XML Forming Bases

How To Form Bases?

Goal for windowed pq-grams:

  • • •

stem

p = 2

base

q = 3

not sensitive to the sibling order sensitive to any other change in the tree

Stems: ignore sibling order a b c d e − → * a a b a c c d c e Bases: do not ignore sibling order!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

slide-41
SLIDE 41

Windowed pq-Grams for Data-Centric XML Forming Bases

Requirements for Bases

Requirements for bases:

detection of node moves robustness to different sortings balanced node weight

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33

slide-42
SLIDE 42

Windowed pq-Grams for Data-Centric XML Forming Bases

Requirements for Bases

Requirements for bases:

detection of node moves robustness to different sortings balanced node weight

Our solution:

windows: simulate all permutations within a window wrapping: wrap windows that extend beyond the right border dummies: extend small sibling sets with dummy nodes

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33

slide-43
SLIDE 43

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 1: Form bases from a sorted sibling sequence Example: stem, sorted sibling sequence, window w = 3 a b c d e *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-44
SLIDE 44

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 2: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

Example: stem, sorted sibling sequence, window w = 3 a b c d e *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-45
SLIDE 45

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 3: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

Example: stem, sorted sibling sequence, window w = 3 a b c d e *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-46
SLIDE 46

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 4: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-47
SLIDE 47

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 5: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-48
SLIDE 48

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 6: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

shift window to the right by one node;

5

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-49
SLIDE 49

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 7: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

shift window to the right by one node;

5

if window extends the right border then wrap window;

6

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d *

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-50
SLIDE 50

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 8: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

shift window to the right by one node;

5

if window extends the right border then wrap window;

6

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d * a c e * a c e d

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-51
SLIDE 51

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 9: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

shift window to the right by one node;

5

if window extends the right border then wrap window;

6

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d * a c e * a c e d a c * d a c * e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-52
SLIDE 52

Windowed pq-Grams for Data-Centric XML Forming Bases

Solution: Windowed pq-Gram Bases

Algorithm 10: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes;

1

initialize window: start with leftmost node;

2

repeat

3

form bases in window: all q-permutations that contain start node;

4

shift window to the right by one node;

5

if window extends the right border then wrap window;

6

until processed all window positions

7

Example: stem, sorted sibling sequence, window w = 3 a b c d e * − → a c d e a c d * a c e * a c e d a c * d a c * e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

slide-53
SLIDE 53

Windowed pq-Grams for Data-Centric XML Forming Bases

Optimal Windowed pq-Grams

Theorem (Optimal Windowed pq-Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1

2

have the following properties:

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

slide-54
SLIDE 54

Windowed pq-Grams for Data-Centric XML Forming Bases

Optimal Windowed pq-Grams

Theorem (Optimal Windowed pq-Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1

2

have the following properties:

  • 1. Detection of node moves:

base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

slide-55
SLIDE 55

Windowed pq-Grams for Data-Centric XML Forming Bases

Optimal Windowed pq-Grams

Theorem (Optimal Windowed pq-Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1

2

have the following properties:

  • 1. Detection of node moves:

base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once)

  • 2. Robustness to different sortings: (k edit operations)

base error ǫ ≤ 2k f

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

slide-56
SLIDE 56

Windowed pq-Grams for Data-Centric XML Forming Bases

Optimal Windowed pq-Grams

Theorem (Optimal Windowed pq-Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1

2

have the following properties:

  • 1. Detection of node moves:

base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once)

  • 2. Robustness to different sortings: (k edit operations)

base error ǫ ≤ 2k f

  • 3. Balanced node weight:

Each non-root node appears in exactly 2w − 2 bases.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

slide-57
SLIDE 57

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-58
SLIDE 58

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1 a b c d b e a b c b d e 1 node move Goal: bases must change

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-59
SLIDE 59

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1 a b c d b e a b c b d e 1 node move Goal: bases must change

Single Node:

c, d, e no bases change c, d, e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-60
SLIDE 60

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1 a b c d b e a b c b d e 1 node move Goal: bases must change

✘ Single Node:

c, d, e no bases change c, d, e

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-61
SLIDE 61

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a b c d b e a b c b d e 1 node move Goal: bases must change

✘ Single Node:

c, d, e no bases change c, d, e

Window: cd, c*, d*, dc,

*c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . .

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-62
SLIDE 62

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Detection of Node Moves

Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a b c d b e a b c b d e 1 node move Goal: bases must change

✘ Single Node:

c, d, e no bases change c, d, e

✓ Window: cd, c*, d*, dc,

*c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . . Windowed pq-grams detect node moves.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

slide-63
SLIDE 63

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation)

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-64
SLIDE 64

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation)

x a b d x a c d 1 rename

x a b d x a d b x a c d x a d c

Sorting A Sorting A Sorting B Sorting B

Goal: Same number of bases change for both sortings.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-65
SLIDE 65

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation)

x a b d x a c d 1 rename

x a b d x a d b x a c d x a d c

Sorting A Sorting A Sorting B Sorting B

Goal: Same number of bases change for both sortings. Consecutive:

Sort A

ab bc 100% bases change ac cd

Sort B

ad db 50% bases change ad dc

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-66
SLIDE 66

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation)

x a b d x a c d 1 rename

x a b d x a d b x a c d x a d c

Sorting A Sorting A Sorting B Sorting B

Goal: Same number of bases change for both sortings.

✘ Consecutive:

Sort A

ab bc 100% bases change ac cd

Sort B

ad db 50% bases change ad dc

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-67
SLIDE 67

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases

x a b d x a c d 1 rename

x a b d x a d b x a c d x a d c

Sorting A Sorting A Sorting B Sorting B

Goal: Same number of bases change for both sortings.

✘ Consecutive:

Sort A

ab bc 100% bases change ac cd

Sort B

ad db 50% bases change ad dc Window:

Sort A

ad ab db. . . 33% bases change ad ac dc. . .

Sort B

ad ab db. . . 33% bases change ad ac dc. . .

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-68
SLIDE 68

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Robustness to Different Sortings

Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases

x a b d x a c d 1 rename

x a b d x a d b x a c d x a d c

Sorting A Sorting A Sorting B Sorting B

Goal: Same number of bases change for both sortings.

✘ Consecutive:

Sort A

ab bc 100% bases change ac cd

Sort B

ad db 50% bases change ad dc

✓ Window:

Sort A

ad ab db. . . 33% bases change ad ac dc. . .

Sort B

ad ab db. . . 33% bases change ad ac dc. . . Windowed pq-grams: Robust to different sortings.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

slide-69
SLIDE 69

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-70
SLIDE 70

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base a b d e f g h i c m n o a b x e f g h i c m n o a b d e f g h i c x n o

1 rename 1 rename

Goal: Same number of bases change for both renames.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-71
SLIDE 71

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base a b d e f g h i c m n o a b x e f g h i c m n o a b d e f g h i c x n o

1 rename 1 rename

Goal: Same number of bases change for both renames. Permutations: 60/137 bases change 6/137 bases change

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-72
SLIDE 72

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base a b d e f g h i c m n o a b x e f g h i c m n o a b d e f g h i c x n o

1 rename 1 rename

Goal: Same number of bases change for both renames.

✘ Permutations:

60/137 bases change 6/137 bases change

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-73
SLIDE 73

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base Window: only permutations within window form a base a b d e f g h i c m n o a b x e f g h i c m n o a b d e f g h i c x n o

1 rename 1 rename

Goal: Same number of bases change for both renames.

✘ Permutations:

60/137 bases change 6/137 bases change Window: 12/51 bases change 12/51 bases change

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-74
SLIDE 74

Windowed pq-Grams for Data-Centric XML Forming Bases

Illustration: Balancing the Node Weight

Permutations: all permutations of size q form a base Window: only permutations within window form a base a b d e f g h i c m n o a b x e f g h i c m n o a b d e f g h i c x n o

1 rename 1 rename

Goal: Same number of bases change for both renames.

✘ Permutations:

60/137 bases change 6/137 bases change

✓ Window:

12/51 bases change 12/51 bases change Windowed pq-grams: Node weight is independent of sibling number.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

slide-75
SLIDE 75

Efficient Approximate Joins with Windowed pq-Gram

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 24 / 33

slide-76
SLIDE 76

Efficient Approximate Joins with Windowed pq-Gram

Approximate Join

F F ′ tid tree T1 x y v z w T2 a b c b T3 a e b h tree tid a b c d e T′

1

d a h i T′

2

x y w z w T′

3

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

slide-77
SLIDE 77

Efficient Approximate Joins with Windowed pq-Gram

Approximate Join

F F ′ tid tree T1 x y v z w T2 a b c b T3 a e b h tree tid a b c d e T′

1

d a h i T′

2

x y w z w T′

3 6 5 5 4 5 5 3 1 2

Simple approach: distance join

  • 1. compute distance between all pairs of trees

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

slide-78
SLIDE 78

Efficient Approximate Joins with Windowed pq-Gram

Approximate Join

F threshold=2 F ′ tid tree T1 x y v z w T2 a b c b T3 a e b h tree tid a b c d e T′

1

d a h i T′

2

x y w z w T′

3 6 5 5 4 5 5 3 1 2

Simple approach: distance join

  • 1. compute distance between all pairs of trees
  • 2. return document pairs within threshold

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

slide-79
SLIDE 79

Efficient Approximate Joins with Windowed pq-Gram

Approximate Join

F threshold=2 F ′ tid tree T1 x y v z w T2 a b c b T3 a e b h tree tid a b c d e T′

1

d a h i T′

2

x y w z w T′

3 6 5 5 4 5 5 3 1 2

Simple approach: distance join

  • 1. compute distance between all pairs of trees
  • 2. return document pairs within threshold

Very expensive: N2 distance computations!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

slide-80
SLIDE 80

Efficient Approximate Joins with Windowed pq-Gram

Usual Join Optimization Does not Apply

Distance join: expensive

nested loop join: evaluate distance function between every input pair

Equality join: efficient

implementation as sort-merge or hash join

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

slide-81
SLIDE 81

Efficient Approximate Joins with Windowed pq-Gram

Usual Join Optimization Does not Apply

Distance join: expensive

nested loop join: evaluate distance function between every input pair

Equality join: efficient

implementation as sort-merge or hash join

Sort-merge and hash join:

first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function

Sort-merge and hash not applicable to distance join:

there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

slide-82
SLIDE 82

Efficient Approximate Joins with Windowed pq-Gram

Usual Join Optimization Does not Apply

Distance join: expensive

nested loop join: evaluate distance function between every input pair

Equality join: efficient

implementation as sort-merge or hash join

Sort-merge and hash join:

first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function

Sort-merge and hash not applicable to distance join:

there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets

Solution: reduce distance join to equality join on pq-grams

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

slide-83
SLIDE 83

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-84
SLIDE 84

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-85
SLIDE 85

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-86
SLIDE 86

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-87
SLIDE 87

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-88
SLIDE 88

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-89
SLIDE 89

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort
  • 3. merge-join

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-90
SLIDE 90

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort
  • 3. merge-join

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f |b ∩ f |

:

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-91
SLIDE 91

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort
  • 3. merge-join

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f |b ∩ f |

:

|a ∩ d|

:

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-92
SLIDE 92

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort
  • 3. merge-join

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f |b ∩ f |

:

|a ∩ d|

:

|b ∩ d|

:

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-93
SLIDE 93

Efficient Approximate Joins with Windowed pq-Gram

Reducing a Distance Join to an Equality Join

Distance join between trees: N2 intersections between integer bags {1, 7}a {1, 7}d {1, 0}b {5, 5}e {4, 6}c {0, 8}f

|a ∩ d| = 2 |a ∩ e| = 0 |a ∩ f | = 0 |b ∩ d| = 1 |b ∩ e| = 0 |b ∩ f | = 1 |c ∩ d| = 0 |c ∩ e| = 0 |c ∩ f | = 0

Optimized pq-gram join: empty intersections are never computed!

  • 1. union

{1a, 7a, 1b, 0b, 4c, 6c} {1d, 7d, 5e, 5e, 0f , 8f }

  • 2. sort
  • 3. merge-join

0b 0f 1a 1d 1b 5e 4c 5e 6c 7d 7a 8f |b ∩ f |

:

|a ∩ d|

: :

|b ∩ d|

:

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

slide-94
SLIDE 94

Experiments

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 28 / 33

slide-95
SLIDE 95

Experiments

Effectiveness of the Windowed pq-Gram Join

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

slide-96
SLIDE 96

Experiments

Effectiveness of the Windowed pq-Gram Join

20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 recall [%] percentage of changed nodes threshold=0.3 threshold=0.5 threshold=0.7 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 precision [%] percentage of changed nodes threshold=0.3 threshold=0.5 threshold=0.7

Experiment: match DBLP articles add noise to articles (missing elements and spelling mistakes) approximate join between original and noisy data measure precision and recall for different thresholds

Windowed pq-grams are effective for data-centric XML

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

slide-97
SLIDE 97

Experiments

Effectiveness of the Windowed pq-Gram Join

20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 recall [%] percentage of changed nodes threshold=0.3 threshold=0.5 threshold=0.7 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 precision [%] percentage of changed nodes threshold=0.3 threshold=0.5 threshold=0.7

Experiment: match DBLP articles add noise to articles (missing elements and spelling mistakes) approximate join between original and noisy data measure precision and recall for different thresholds Datasets: DBLP: articles depth 1.9, 15 nodes (max 1494 nodes) SwissProt: protein descriptions depth 3.5, 104 nodes (max 2640 nodes) Treebank: tagged English sentences depth 6.9 (max depth 30), 43 nodes

Windowed pq-grams are effective for data-centric XML

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

slide-98
SLIDE 98

Experiments

Efficiency of the Optimized pq-Gram Join

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33

slide-99
SLIDE 99

Experiments

Efficiency of the Optimized pq-Gram Join

Optimized pq-gram join: very efficient

500 1000 1500 2000 1e+06 2e+06 250 500 750 1000 time [sec] number of nodes number of trees nested-loop join

  • ptimized join

compute nested-loop join between trees compute optimized pq-gram join between trees measure wallclock time

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33

slide-100
SLIDE 100

Related Work

Outline

1 Motivation 2 Windowed pq-Grams for Data-Centric XML

Windowed pq-Grams Tree Sorting Forming Bases

3 Efficient Approximate Joins with Windowed pq-Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 31 / 33

slide-101
SLIDE 101

Related Work

Distances between Unordered Trees

Edit Distances between Unordered Trees [Zhang et al., 1992]: proof for NP-completeness [Kailing et al., 2004]: lower bound for a restricted edit distance [Chawathe and Garcia-Molina, 1997]: O(n3) heuristics Our solution: O(n log n) approximation Approximate Join [Gravano et al., 2001]: efficient approximate join for strings

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 32 / 33

slide-102
SLIDE 102

Conclusion and Future Work

Conclusion and Future Work

Windowed pq-grams for unordered trees: O(n log n) approximation of NP-complete edit distance Key problem: all permutations must be considered Our approach: sort trees and simulate permutations with window Sorting: works for pq-grams, but not for edit distance Window technique guarantees core properties

detection of node moves robustness to different sortings balanced node weight

Efficient approximate join: reduces distance join to equality join

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 33 / 33

slide-103
SLIDE 103

Conclusion and Future Work

Conclusion and Future Work

Windowed pq-grams for unordered trees: O(n log n) approximation of NP-complete edit distance Key problem: all permutations must be considered Our approach: sort trees and simulate permutations with window Sorting: works for pq-grams, but not for edit distance Window technique guarantees core properties

detection of node moves robustness to different sortings balanced node weight

Efficient approximate join: reduces distance join to equality join Future work: incremental updates of the windowed pq-gram index include approximate string matching into XML distance

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 33 / 33

slide-104
SLIDE 104

Conclusion and Future Work

Sudarshan S. Chawathe and Hector Garcia-Molina. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, Tucson, Arizona, United States, May 1997. ACM Press. Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas,

  • S. Muthukrishnan, and Divesh Srivastava.

Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. Karin Kailing, Hans-Peter Kriegel, Stefan Sch¨

  • nauer, and Thomas

Seidl. Efficient similarity search for hierarchical data in large databases. In Proceedings of the International Conference on Extending Database Technology (EDBT), volume 2992 of Lecture Notes in Computer

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 33 / 33

slide-105
SLIDE 105

Conclusion and Future Work

Science, pages 676–693, Heraklion, Crete, Greece, March 2004. Springer. Kaizhong Zhang, Richard Statman, and Dennis Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133–139, 1992.

Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 33 / 33