Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael - - PowerPoint PPT Presentation

approximate tree matching with pq grams
SMART_READER_LITE
LIVE PREVIEW

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael - - PowerPoint PPT Presentation

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B ohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy www.inf.unibz.it 1 Motivation . . . . . . . . . . . . .


slide-1
SLIDE 1

Approximate Tree Matching with pq-Grams

Nikolaus Augstena, Michael B¨

  • hlen, Johann Gamper

DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy

www.inf.unibz.it

1 – Motivation . . . . . . . . . . . . . . . . . . 2 2 – Related Work . . . . . . . . . . . . . . . . 6 3 – pq-Grams . . . . . . . . . . . . . . . . . . 7 4 – Properties . . . . . . . . . . . . . . . . . . 11 5 – Experiments . . . . . . . . . . . . . . . . . 14 6 – Conclusion and Future Work . . . . . . . . . 21

aSupported by the Municipality of Bozen-Bolzano.

slide-2
SLIDE 2

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-3
SLIDE 3

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object.

Land Register Registration Office

LR id num entr apt

  • wner

91 1

  • 1

Maier 91 1

  • 2

Rossi 91 1

  • 3

Maier 91 2 A

  • Braun

... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ... RO resident id num entr apt Pichler 30 1

  • 1

Rieder 30 1

  • 3

Fischer 30 2 A

  • Rossi

30 2 B 1 ... Spiro 120 3 A 1 Barducci 120 3 A 2 Costanzi 120 3 A 3 ... SLR id street 139 SIEGESPLATZ 109 GILMWEG 185 P. R. GIULIANI STR. 91 CESARE ABBA STRASSE 165 MUSTERPLATZ 115 ITALIENSTRASSE 259 TELSERDURCHGANG 207 SERNESIDURCHGANG 33 BOZNER BODENWEG 263 TRIESTER STRASSE 262 TRIENTER STRASSE 285 WALTHERPLATZ 266 TURINER STRASSE ... SRO id street 30 Giuseppe-Cesare-Abba-Str. 5220 Bozner-Boden-Str. 3000 Hermann-von-Gilm-Str. 3030 Pater-Reginaldo-Giuliani-Str. 3540 Italienallee 4440 Musterplatzl 7180 Raffaello-Sernesi-Galerie 7590 Telsergalerie 7620 Friedensplatz 7650 Turiner Str. 7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ...

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-4
SLIDE 4

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment?

Land Register Registration Office

LR id num entr apt

  • wner

91 1

  • 1

Maier 91 1

  • 2

Rossi 91 1

  • 3

Maier 91 2 A

  • Braun

... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ... RO resident id num entr apt Pichler 30 1

  • 1

Rieder 30 1

  • 3

Fischer 30 2 A

  • Rossi

30 2 B 1 ... Spiro 120 3 A 1 Barducci 120 3 A 2 Costanzi 120 3 A 3 ... SLR id street 139 SIEGESPLATZ 109 GILMWEG 185 P. R. GIULIANI STR. 91 CESARE ABBA STRASSE 165 MUSTERPLATZ 115 ITALIENSTRASSE 259 TELSERDURCHGANG 207 SERNESIDURCHGANG 33 BOZNER BODENWEG 263 TRIESTER STRASSE 262 TRIENTER STRASSE 285 WALTHERPLATZ 266 TURINER STRASSE ... SRO id street 30 Giuseppe-Cesare-Abba-Str. 5220 Bozner-Boden-Str. 3000 Hermann-von-Gilm-Str. 3030 Pater-Reginaldo-Giuliani-Str. 3540 Italienallee 4440 Musterplatzl 7180 Raffaello-Sernesi-Galerie 7590 Telsergalerie 7620 Friedensplatz 7650 Turiner Str. 7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ...

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-5
SLIDE 5

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment?

Land Register Registration Office

LR id num entr apt

  • wner

91 1

  • 1

Maier 91 1

  • 2

Rossi 91 1

  • 3

Maier 91 2 A

  • Braun

... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ... RO resident id num entr apt Pichler 30 1

  • 1

Rieder 30 1

  • 3

Fischer 30 2 A

  • Rossi

30 2 B 1 ... Spiro 120 3 A 1 Barducci 120 3 A 2 Costanzi 120 3 A 3 ... SLR id street 139 SIEGESPLATZ 109 GILMWEG 185 P. R. GIULIANI STR. 91 CESARE ABBA STRASSE 165 MUSTERPLATZ 115 ITALIENSTRASSE 259 TELSERDURCHGANG 207 SERNESIDURCHGANG 33 BOZNER BODENWEG 263 TRIESTER STRASSE 262 TRIENTER STRASSE 285 WALTHERPLATZ 266 TURINER STRASSE ... SRO id street 30 Giuseppe-Cesare-Abba-Str. 5220 Bozner-Boden-Str. 3000 Hermann-von-Gilm-Str. 3030 Pater-Reginaldo-Giuliani-Str. 3540 Italienallee 4440 Musterplatzl 7180 Raffaello-Sernesi-Galerie 7590 Telsergalerie 7620 Friedensplatz 7650 Turiner Str. 7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ...

?

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-6
SLIDE 6

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment?

Land Register Registration Office

LR id num entr apt

  • wner

91 1

  • 1

Maier 91 1

  • 2

Rossi 91 1

  • 3

Maier 91 2 A

  • Braun

... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ... RO resident id num entr apt Pichler

!

30 1

  • 1

Rieder 30 1

  • 3

Fischer 30 2 A

  • Rossi

30 2 B 1 ... Spiro 120 3 A 1 Barducci 120 3 A 2 Costanzi 120 3 A 3 ... SLR id street 139 SIEGESPLATZ 109 GILMWEG 185 P. R. GIULIANI STR. 91 CESARE ABBA STRASSE 165 MUSTERPLATZ 115 ITALIENSTRASSE 259 TELSERDURCHGANG 207 SERNESIDURCHGANG 33 BOZNER BODENWEG 263 TRIESTER STRASSE 262 TRIENTER STRASSE 285 WALTHERPLATZ 266 TURINER STRASSE ... SRO id street 30 Giuseppe-Cesare-Abba-Str. 5220 Bozner-Boden-Str. 3000 Hermann-von-Gilm-Str. 3030 Pater-Reginaldo-Giuliani-Str. 3540 Italienallee 4440 Musterplatzl 7180 Raffaello-Sernesi-Galerie 7590 Telsergalerie 7620 Friedensplatz 7650 Turiner Str. 7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ...

?

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-7
SLIDE 7

Motivation — Address Trees

☞ residential addresses are hierarchical → address tree

Address trees:

1 2 3

  • 1

A 1 2 3 4 B D 2 3 A B C 4 6 CESARE ABBA STRASSE 1 3

  • 1

A 1 2 3 4 B C 2 3 1 2 3 A B C 4 6 Giuseppe-Cesare-Abba-Str.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 3

slide-8
SLIDE 8

Motivation — Address Trees

☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree

How similar are two address trees? Address trees:

1 2 3

  • 1

A 1 2 3 4 B D 2 3 A B C 4 6 CESARE ABBA STRASSE 1 3

  • 1

A 1 2 3 4 B C 2 3 1 2 3 A B C 4 6 Giuseppe-Cesare-Abba-Str.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 3

slide-9
SLIDE 9

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-10
SLIDE 10

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

T T′′ b d h i e f g c a b d h i k e f g c

x VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-11
SLIDE 11

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

T T′ T′′ b d h i e f g c a insert(k, e, 3) − → b d h i k e f g c a b d h i k e f g c

x VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-12
SLIDE 12

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

T T′ T′′ b d h i e f g c a insert(k, e, 3) − → b d h i k e f g c a rename(a, x) − → b d h i k e f g c

x

edit distance: disted(T, T′′) = 2

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-13
SLIDE 13

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

T T′ T′′ b d h i e f g c a insert(k, e, 3) − → b d h i k e f g c a rename(a, x) − → b d h i k e f g c

x

edit distance: disted(T, T′′) = 2

☞ Problem: Best algorithms O(n2 log2(n)) ⇒ not scalable.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-14
SLIDE 14

Motivation — Problem Definition

☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 5

slide-15
SLIDE 15

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n))

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-16
SLIDE 16

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-17
SLIDE 17

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2) for very different trees

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-18
SLIDE 18

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-19
SLIDE 19

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move ☞ Related work for strings: ➳ Navarro [Navarro, 2001]: good overview of the edit distance for strings and its variants ➳ Ukkonen [Ukkonen, 1992]: q-grams as lower bound for string edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-20
SLIDE 20

pq-Grams — Subtrees of the pq-Extended Tree

☞ Extended Tree Tpq:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

the last child of each non-leaf node

➳ q children to each leaf

e b a b c a − → * * * * * * * e * * * b * * a * * * b * * * c * * a * T T2,3 2, 3-Extended Tree:

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 7

slide-21
SLIDE 21

pq-Grams — Subtrees of the pq-Extended Tree

☞ Extended Tree Tpq:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

the last child of each non-leaf node

➳ q children to each leaf ☞ pq-Gram G: Subtree of Tpq. ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children.

Contiguous siblings in G are contiguous siblings in Tpq.

e b a b c a − → * * * * * * * e * * * b * * a * * * b * * * c * * a * T T2,3 2, 3-Extended Tree: 2, 3-Gram Pattern:

  • q

p − 1

anchor

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 7

slide-22
SLIDE 22

pq-Grams — Subtrees of the pq-Extended Tree

☞ Extended Tree Tpq:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

the last child of each non-leaf node

➳ q children to each leaf ☞ pq-Gram G: Subtree of Tpq. ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children.

Contiguous siblings in G are contiguous siblings in Tpq.

e b a b c a − → * * * * * * * e * * * b * * a * * * b * * * c * * a * T T2,3 2, 3-Extended Tree: 2, 3-Gram Pattern:

  • q

p − 1

anchor Example pq-Grams for T:

* a b

a

* b c

a

e b

a

a * 2, 3-gram 1, 2-gram 3, 2-gram

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 7

slide-23
SLIDE 23

pq-Grams — Subtrees of the pq-Extended Tree

☞ Extended Tree Tpq:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

the last child of each non-leaf node

➳ q children to each leaf ☞ pq-Gram G: Subtree of Tpq. ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children.

Contiguous siblings in G are contiguous siblings in Tpq.

☞ pq-gram Profile Pp,q(T): ➳ Bag of all pq-grams of T.

e b a b c a − → * * * * * * * e * * * b * * a * * * b * * * c * * a * T T2,3 2, 3-Extended Tree: 2, 3-Gram Pattern:

  • q

p − 1

anchor Example pq-Grams for T:

* a b

a

* b c

a

e b

a

a * 2, 3-gram 1, 2-gram 3, 2-gram 2, 3-Gram Profile of T: * * a a * * * e a a * * * e a * e b a a * * * b a e b * a a b * * a a * a b a * * * * b a a b c a * * * * c a b c * a * c * * a *

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 7

slide-24
SLIDE 24

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

T P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-25
SLIDE 25

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c * *

anc

T P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-26
SLIDE 26

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc

T P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-27
SLIDE 27

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * * sib

T P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-28
SLIDE 28

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * sib

T P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-29
SLIDE 29

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * sib

T

* a * * a

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-30
SLIDE 30

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * sib

T

* a * * a

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-31
SLIDE 31

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc

T

* a * * a

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-32
SLIDE 32

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * sib

T

* a * * a

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-33
SLIDE 33

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * sib

T

* a * * a a a * * e

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-34
SLIDE 34

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * sib

T

* a * * a a a * * e

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-35
SLIDE 35

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc

T

* a * * a a a * * e

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-36
SLIDE 36

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * * sib

T

* a * * a a a * * e

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-37
SLIDE 37

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * * sib

T

* a * * a a a * * e a e * * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-38
SLIDE 38

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * sib

T

* a * * a a a * * e a e * * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-39
SLIDE 39

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-40
SLIDE 40

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-41
SLIDE 41

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c

anc * * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-42
SLIDE 42

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-43
SLIDE 43

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-44
SLIDE 44

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-45
SLIDE 45

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c * * a c * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-46
SLIDE 46

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c *

anc * * sib

T

* a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c * * a c * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-47
SLIDE 47

pq-Grams — The pq-Gram Profile

The pq-gram profile is

☞ small → size O(n)

Theorem 1 For tree T with l leaves and i non-leaves:

| Pp,q(T)| = 2l + qi − 1.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 9

slide-48
SLIDE 48

pq-Grams — The pq-Gram Profile

The pq-gram profile is

☞ small → size O(n) ☞ easy to store ➳ represent the pq-grams by fingerprint hash value ➳ store profile in single-attribute relation

Theorem 1 For tree T with l leaves and i non-leaves:

| Pp,q(T)| = 2l + qi − 1. T e b a b c a

P2,3(T)

pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)

P2,3(T)

hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 9

slide-49
SLIDE 49

pq-Grams — The pq-Gram Profile

The pq-gram profile is

☞ small → size O(n) ☞ easy to store ➳ represent the pq-grams by fingerprint hash value ➳ store profile in single-attribute relation ☞ allows effective distance computation between trees

Theorem 1 For tree T with l leaves and i non-leaves:

| Pp,q(T)| = 2l + qi − 1. T e b a b c a

P2,3(T)

pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)

P2,3(T)

hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 9

slide-50
SLIDE 50

pq-Grams — pq-Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq pq pq-gram distance is: ∆p,q(T1, T2) = 1 − 2 | Pp,q(T1) ∩ Pp,q(T2)| | Pp,q(T1) ∪ Pp,q(T2)|

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 10

slide-51
SLIDE 51

pq-Grams — pq-Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq pq pq-gram distance is: ∆p,q(T1, T2) = 1 − 2 | Pp,q(T1) ∩ Pp,q(T2)| | Pp,q(T1) ∪ Pp,q(T2)| ☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 10

slide-52
SLIDE 52

pq-Grams — pq-Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq pq pq-gram distance is: ∆p,q(T1, T2) = 1 − 2 | Pp,q(T1) ∩ Pp,q(T2)| | Pp,q(T1) ∪ Pp,q(T2)| ☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations) ☞ other terms are constants for normalization: ➳ ∆p,q(T1, T2) = 1 if trees have no pq-grams in common ➳ ∆p,q(T1, T2) = 0 if trees have the same pq-gram profile

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 10

slide-53
SLIDE 53

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-54
SLIDE 54

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

T b d h i k e f g c a

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-55
SLIDE 55

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

T′ b d h i e f c a disted = 2 ∆2,3 = 0.30

T b d h i k e f g c a

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-56
SLIDE 56

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

T′ b d h i e f c a disted = 2 ∆2,3 = 0.30

T b d h i k e f g c a disted = 2 ∆2,3 = 0.89

T′′ b d h i k f g a

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-57
SLIDE 57

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant ☞ Address application: Mismatch of houses (with subnumbers and apartment numbers) is more

significant than mismatch of apartments.

T′ b d h i e f c a disted = 2 ∆2,3 = 0.30

T b d h i k e f g c a disted = 2 ∆2,3 = 0.89

T′′ b d h i k f g a

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-58
SLIDE 58

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq-grams change ☞ pq-grams change ⇒ distance increases

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 12

slide-59
SLIDE 59

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq-grams change ☞ pq-grams change ⇒ distance increases ☞ cntpq(T, v) ≈ q + f p ➳ leaf change: cost depends only on q q q ➳ non-leaf change: p p p prevalent

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 12

slide-60
SLIDE 60

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq-grams change ☞ pq-grams change ⇒ distance increases ☞ cntpq(T, v) ≈ q + f p ➳ leaf change: cost depends only on q q q ➳ non-leaf change: p p p prevalent

Theorem 2 For a complete tree T (fanout f, depth d) the number of pq-grams that contain a node v of level l is:

cntpq(T, v) = q sgn(l) + f p−1

f−1 (f + q − 1)

if p ≤ d − l

f d−l−1 f−1 (f + q − 1) + f d−l

if p > d − l.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 12

slide-61
SLIDE 61

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq-grams change ☞ pq-grams change ⇒ distance increases ☞ cntpq(T, v) ≈ q + f p ➳ leaf change: cost depends only on q q q ➳ non-leaf change: p p p prevalent ➳ p p p controls structure sensitivity

Theorem 2 For a complete tree T (fanout f, depth d) the number of pq-grams that contain a node v of level l is:

cntpq(T, v) = q sgn(l) + f p−1

f−1 (f + q − 1)

if p ≤ d − l

f d−l−1 f−1 (f + q − 1) + f d−l

if p > d − l.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 12

slide-62
SLIDE 62

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes!

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-63
SLIDE 63

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes

change

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-64
SLIDE 64

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes

change

☞ Local changes → less pq-grams change ➳ neighbored nodes “share” pq-grams ➳ change counts only once.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-65
SLIDE 65

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes

change

☞ Local changes → less pq-grams change ➳ neighbored nodes “share” pq-grams ➳ change counts only once.

Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq + q − 1

pq-grams change

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-66
SLIDE 66

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes

change

☞ Local changes → less pq-grams change ➳ neighbored nodes “share” pq-grams ➳ change counts only once.

Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq + q − 1

pq-grams change

Example: Delete subtree rooted in e

  • e → in 11 pq-grams
  • h,i, k → in 4 pq-grams each
  • actually changing: 11 pq-grams (vs. 3×4+11 = 23)

b d h i k e f g c a T

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-67
SLIDE 67

Experiments — Sensitivity to Structure Changes

☞ Cost for leaf change → depends only on q

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 14

slide-68
SLIDE 68

Experiments — Sensitivity to Structure Changes

☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 14

slide-69
SLIDE 69

Experiments — Sensitivity to Structure Changes

☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance

vary p vary q

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 1,3-grams 2,3-grams 3,3-grams 4,3-grams 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 2,1-grams 2,2-grams 2,3-grams 2,4-grams

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 14

slide-70
SLIDE 70

Experiments — Sensitivity to Structure Changes

☞ Cost for non-leaf change → controlled by p

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 15

slide-71
SLIDE 71

Experiments — Sensitivity to Structure Changes

☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 15

slide-72
SLIDE 72

Experiments — Sensitivity to Structure Changes

☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance

vary p vary q 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 1,3-grams 2,3-grams 3,3-grams 4,3-grams 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 2,1-grams 2,2-grams 2,3-grams 2,4-grams

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 15

slide-73
SLIDE 73

Experiments — Robustness to Local Changes

☞ Subtree deletions → cheaper than distributed deletions

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 16

slide-74
SLIDE 74

Experiments — Robustness to Local Changes

☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 16

slide-75
SLIDE 75

Experiments — Robustness to Local Changes

☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 5 10 15 20 2,3-gram-distance edit distance distributed changes local changes

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 16

slide-76
SLIDE 76

Experiments — Scalability to Large Trees

☞ pq-gram distance → scalable to large trees ☞ compare with edit distance

aimplementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 17

slide-77
SLIDE 77

Experiments — Scalability to Large Trees

☞ pq-gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq-gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time

aimplementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 17

slide-78
SLIDE 78

Experiments — Scalability to Large Trees

☞ pq-gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq-gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time

27 hours

0.001 0.01 0.1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1e+06 time [sec] number of nodes (n) edit dist 2,3-gram dist

aimplementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 17

slide-79
SLIDE 79

Experiments — Influence of p and q on Scalability

☞ Scalability independent of p and q.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 18

slide-80
SLIDE 80

Experiments — Influence of p and q on Scalability

☞ Scalability independent of p and q. ☞ Experiment: For pair of trees ➳ compute pq-gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 18

slide-81
SLIDE 81

Experiments — Influence of p and q on Scalability

☞ Scalability independent of p and q. ☞ Experiment: For pair of trees ➳ compute pq-gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time

5 10 15 20 25 100000 200000 300000 400000 500000 time [sec] number of nodes (n) 3,4-gram dist 2,3-gram dist 1,2-gram dist

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 18

slide-82
SLIDE 82

Experiments — Effectiveness for Real World Dataset

☞ pq-gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 19

slide-83
SLIDE 83

Experiments — Effectiveness for Real World Dataset

☞ pq-gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 19

slide-84
SLIDE 84

Experiments — Effectiveness for Real World Dataset

☞ pq-gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches

accuracy correct false pos. runtime edit dist 82.7% 248 9 187,538s 1,2-grams 78.3% 235 5 181s 2,3-grams 77.3% 232 4 204s 3,2-grams 79.3% 238 2 180s tree-embedding 69.0% 207 8 313s buttom-up 50.0% 150 12 237s

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 19

slide-85
SLIDE 85

Experiments — Tree-Embedding vs. pq-Grams

tree edit distance embedding

pq-grams ☞ variable shapes ☞ fixed shape

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 20

slide-86
SLIDE 86

Experiments — Tree-Embedding vs. pq-Grams

tree edit distance embedding

pq-grams ☞ variable shapes ☞ fixed shape ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure ☞ horizontal and vertical structure

Typical address tree (nearly complete tree): Phase 1 2 3 4 5 6 tot. single nodes 29 8 4 2 1

  • 44

chains

  • 1

1

  • 2
  • cont. leaves
  • 7

2

  • 9

subtrees

  • 3

4 3 2 1 13 tree-embed

pq-gram

65% 0% 3% 0% 13% 0% 19% 100%

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 20

slide-87
SLIDE 87

Experiments — Tree-Embedding vs. pq-Grams

tree edit distance embedding

pq-grams ☞ variable shapes ☞ fixed shape ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure ☞ horizontal and vertical structure ☞ guarantees with respect to tree edit distance ☞ emphasizes structure

Typical address tree (nearly complete tree): Phase 1 2 3 4 5 6 tot. single nodes 29 8 4 2 1

  • 44

chains

  • 1

1

  • 2
  • cont. leaves
  • 7

2

  • 9

subtrees

  • 3

4 3 2 1 13 tree-embed

pq-gram

65% 0% 3% 0% 13% 0% 19% 100%

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 20

slide-88
SLIDE 88

Conclusion and Future Work

☞ pq-gram distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-89
SLIDE 89

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-90
SLIDE 90

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-91
SLIDE 91

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-92
SLIDE 92

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-93
SLIDE 93

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work:

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-94
SLIDE 94

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq-gram approximations

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-95
SLIDE 95

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq-gram approximations ➳ clustering of XML data

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-96
SLIDE 96

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq-gram approximations ➳ clustering of XML data ➳ incremental updates of pq-gram profiles

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-97
SLIDE 97

References

[Chawathe and Garcia-Molina, 1997] Chawathe, S. S. and Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, Tucson, Arizona, United States. ACM Press. [Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 493–504, Montreal, Quebec, Canada. ACM Press. [Garofalakis and Kumar, 2003] Garofalakis, M. and Kumar, A. (2003). Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 143–154, San Diego, California. ACM Press. [Garofalakis and Kumar, 2005] Garofalakis, M. and Kumar, A. (2005). XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332. [Jiang et al., 1995] Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of trees—an alternative to tree

  • edit. Theoretical Computer Science, 143(1):137–148.

[Klein, 1998] Klein, P . N. (1998). Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 91–102, Venice, Italy. Springer.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 22

slide-98
SLIDE 98

[Lee et al., 2004] Lee, K.-H., Choy, Y.-C., and Cho, S.-B. (2004). An efficient algorithm to compute differences between structured documents. IEEE Transactions on Knowledge and Data Engineering, 16(8):965–979. [Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88. [Selkow, 1977] Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186. [Tanaka and Tanaka, 1988] Tanaka, E. and Tanaka, K. (1988). The tree-to-tree editing problem. Int. Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2(2):221–240. [Ukkonen, 1992] Ukkonen, E. (1992). Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211. [Valiente, 2001] Valiente, G. (2001). An efficient bottom-up distance between trees. In Proceedings of the 8th Symposium on String Processing and Information Retrieval, pages 212–219, Laguna de San Rafael,

  • Chile. IEEE Computer Science Press.

[Yang, 1991] Yang, W. (1991). Identifying syntactic differences between two programs. Software—Practice & Experience, 21(7):739–755. [Zhang and Shasha, 1989] Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 23