Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael - - PowerPoint PPT Presentation

approximate tree matching with pq grams
SMART_READER_LITE
LIVE PREVIEW

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael - - PowerPoint PPT Presentation

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B ohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy www.inf.unibz.it 1 Motivation . . . . . . . . . . . . .


slide-1
SLIDE 1

Approximate Tree Matching with pq-Grams

Nikolaus Augstena, Michael B¨

  • hlen, Johann Gamper

DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy

www.inf.unibz.it

1 – Motivation . . . . . . . . . . . . . . . . . . 2 2 – Related Work . . . . . . . . . . . . . . . . 6 3 – pq-Grams . . . . . . . . . . . . . . . . . . 7 4 – Properties . . . . . . . . . . . . . . . . . . 11 5 – Experiments . . . . . . . . . . . . . . . . . 14 6 – Conclusion and Future Work . . . . . . . . . 21

aSupported by the Municipality of Bozen-Bolzano.

slide-2
SLIDE 2

Motivation — Example Data Sources

☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment?

Land Register Registration Office

LR id num entr apt

  • wner

91 1

  • 1

Maier 91 1

  • 2

Rossi 91 1

  • 3

Maier 91 2 A

  • Braun

... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ... RO resident id num entr apt Pichler

!

30 1

  • 1

Rieder 30 1

  • 3

Fischer 30 2 A

  • Rossi

30 2 B 1 ... Spiro 120 3 A 1 Barducci 120 3 A 2 Costanzi 120 3 A 3 ... SLR id street 139 SIEGESPLATZ 109 GILMWEG 185 P. R. GIULIANI STR. 91 CESARE ABBA STRASSE 165 MUSTERPLATZ 115 ITALIENSTRASSE 259 TELSERDURCHGANG 207 SERNESIDURCHGANG 33 BOZNER BODENWEG 263 TRIESTER STRASSE 262 TRIENTER STRASSE 285 WALTHERPLATZ 266 TURINER STRASSE ... SRO id street 30 Giuseppe-Cesare-Abba-Str. 5220 Bozner-Boden-Str. 3000 Hermann-von-Gilm-Str. 3030 Pater-Reginaldo-Giuliani-Str. 3540 Italienallee 4440 Musterplatzl 7180 Raffaello-Sernesi-Galerie 7590 Telsergalerie 7620 Friedensplatz 7650 Turiner Str. 7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ...

?

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 2

slide-3
SLIDE 3

Motivation — Address Trees

☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree

How similar are two address trees? Address trees:

1 2 3

  • 1

A 1 2 3 4 B D 2 3 A B C 4 6 CESARE ABBA STRASSE 1 3

  • 1

A 1 2 3 4 B C 2 3 1 2 3 A B C 4 6 Giuseppe-Cesare-Abba-Str.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 3

slide-4
SLIDE 4

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label

change) that transform one tree into an other.

T T′ T′′ b d h i e f g c a insert(k, e, 3) − → b d h i k e f g c a rename(a, x) − → b d h i k e f g c

x

edit distance: disted(T, T′′) = 2

☞ Problem: Best algorithms O(n2 log2(n)) ⇒ not scalable.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 4

slide-5
SLIDE 5

Motivation — Problem Definition

☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 5

slide-6
SLIDE 6

Related Work — Tree Distances

☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2(n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move ☞ Related work for strings: ➳ Navarro [Navarro, 2001]: good overview of the edit distance for strings and its variants ➳ Ukkonen [Ukkonen, 1992]: q-grams as lower bound for string edit distance

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 6

slide-7
SLIDE 7

pq-Grams — Subtrees of the pq-Extended Tree

☞ Extended Tree Tpq:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

the last child of each non-leaf node

➳ q children to each leaf ☞ pq-Gram G: Subtree of Tpq. ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children.

Contiguous siblings in G are contiguous siblings in Tpq.

☞ pq-gram Profile Pp,q(T): ➳ Bag of all pq-grams of T.

e b a b c a − → * * * * * * * e * * * b * * a * * * b * * * c * * a * T T2,3 2, 3-Extended Tree: 2, 3-Gram Pattern:

  • q

p − 1

anchor Example pq-Grams for T:

* a b

a

* b c

a

e b

a

a * 2, 3-gram 1, 2-gram 3, 2-gram 2, 3-Gram Profile of T: * * a a * * * e a a * * * e a * e b a a * * * b a e b * a a b * * a a * a b a * * * * b a a b c a * * * * c a b c * a * c * * a *

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 7

slide-8
SLIDE 8

pq-Grams — Algorithm for pq-Gram Profile

1 CREATEPROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5

if r is a leaf then

6 P := P ∪ (anc ◦ sib) 7

else

8

for each child c (from left to right) of r do

9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12

for k := 1 to q − 1

13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15

return P a a e b b c * *

anc * anc * anc * * * sib * anc * * sib * anc * * sib * anc * * sib anc anc * * sib anc * * sib anc * * sib anc anc * * * sib anc * * * sib anc * sib anc * sib anc * sib anc * * sib * anc * sib * anc sib * anc * sib * anc * * sib * anc * * sib

T

* a * * a * a * * a * a * * a * a * * a * a * * a a a * * e * a * * a a a * * e * a * * a a a * * e * a * * a a a * * e * a * * a a a * * e a e * * * * a * * a a a * * e a e * * * * a * * a a a * * e a e * * * a a * e b a b * * * * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c * * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c * * a c * * * a * * a a a * * e a e * * * a a * e b a b * * * a a e b * a a b * * * a * a b a b * * * * a a b c a c * * * * a b c * * a c * *

P

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 8

slide-9
SLIDE 9

pq-Grams — The pq-Gram Profile

The pq-gram profile is

☞ small → size O(n) ☞ easy to store ➳ represent the pq-grams by fingerprint hash value ➳ store profile in single-attribute relation ☞ allows effective distance computation between trees

Theorem 1 For tree T with l leaves and i non-leaves:

| Pp,q(T)| = 2l + qi − 1. T e b a b c a

P2,3(T)

pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)

P2,3(T)

hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 9

slide-10
SLIDE 10

pq-Grams — pq-Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq pq pq-gram distance is: ∆p,q(T1, T2) = 1 − 2 | Pp,q(T1) ∩ Pp,q(T2)| | Pp,q(T1) ∪ Pp,q(T2)| ☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations) ☞ other terms are constants for normalization: ➳ ∆p,q(T1, T2) = 1 if trees have no pq-grams in common ➳ ∆p,q(T1, T2) = 0 if trees have the same pq-gram profile

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 10

slide-11
SLIDE 11

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant ☞ Address application: Mismatch of houses (with subnumbers and apartment numbers) is more

significant than mismatch of apartments.

T′ b d h i e f c a disted = 2 ∆2,3 = 0.30

T b d h i k e f g c a disted = 2 ∆2,3 = 0.89

T′′ b d h i k f g a

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 11

slide-12
SLIDE 12

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq-grams change ☞ pq-grams change ⇒ distance increases ☞ cntpq(T, v) ≈ q + f p ➳ leaf change: cost depends only on q q q ➳ non-leaf change: p p p prevalent ➳ p p p controls structure sensitivity

Theorem 2 For a complete tree T (fanout f, depth d) the number of pq-grams that contain a node v of level l is:

cntpq(T, v) = q sgn(l) + f p−1

f−1 (f + q − 1)

if p ≤ d − l

f d−l−1 f−1 (f + q − 1) + f d−l

if p > d − l.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 12

slide-13
SLIDE 13

Properties — Robust to Local Changes

☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes

change

☞ Local changes → less pq-grams change ➳ neighbored nodes “share” pq-grams ➳ change counts only once.

Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq + q − 1

pq-grams change

Example: Delete subtree rooted in e

  • e → in 11 pq-grams
  • h,i, k → in 4 pq-grams each
  • actually changing: 11 pq-grams (vs. 3×4+11 = 23)

b d h i k e f g c a T

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 13

slide-14
SLIDE 14

Experiments — Sensitivity to Structure Changes

☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance

vary p vary q

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 1,3-grams 2,3-grams 3,3-grams 4,3-grams 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 2,1-grams 2,2-grams 2,3-grams 2,4-grams

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 14

slide-15
SLIDE 15

Experiments — Sensitivity to Structure Changes

☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance

vary p vary q 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 1,3-grams 2,3-grams 3,3-grams 4,3-grams 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 4 6 8 10 12 14 16 18 20 pq-gram distance edit distance 2,1-grams 2,2-grams 2,3-grams 2,4-grams

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 15

slide-16
SLIDE 16

Experiments — Robustness to Local Changes

☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 5 10 15 20 2,3-gram-distance edit distance distributed changes local changes

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.) VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 16

slide-17
SLIDE 17

Experiments — Scalability to Large Trees

☞ pq-gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq-gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time

27 hours

0.001 0.01 0.1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1e+06 time [sec] number of nodes (n) edit dist 2,3-gram dist

aimplementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 17

slide-18
SLIDE 18

Experiments — Influence of p and q on Scalability

☞ Scalability independent of p and q. ☞ Experiment: For pair of trees ➳ compute pq-gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time

5 10 15 20 25 100000 200000 300000 400000 500000 time [sec] number of nodes (n) 3,4-gram dist 2,3-gram dist 1,2-gram dist

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 18

slide-19
SLIDE 19

Experiments — Effectiveness for Real World Dataset

☞ pq-gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches

accuracy correct false pos. runtime edit dist 82.7% 248 9 187,538s 1,2-grams 78.3% 235 5 181s 2,3-grams 77.3% 232 4 204s 3,2-grams 79.3% 238 2 180s tree-embedding 69.0% 207 8 313s buttom-up 50.0% 150 12 237s

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 19

slide-20
SLIDE 20

Experiments — Tree-Embedding vs. pq-Grams

tree edit distance embedding

pq-grams ☞ variable shapes ☞ fixed shape ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure ☞ horizontal and vertical structure ☞ guarantees with respect to tree edit distance ☞ emphasizes structure

Typical address tree (nearly complete tree): Phase 1 2 3 4 5 6 tot. single nodes 29 8 4 2 1

  • 44

chains

  • 1

1

  • 2
  • cont. leaves
  • 7

2

  • 9

subtrees

  • 3

4 3 2 1 13 tree-embed

pq-gram

65% 0% 3% 0% 13% 0% 19% 100%

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 20

slide-21
SLIDE 21

Conclusion and Future Work

☞ pq-gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq-gram approximations ➳ clustering of XML data ➳ incremental updates of pq-gram profiles

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 21

slide-22
SLIDE 22

References

[Chawathe and Garcia-Molina, 1997] Chawathe, S. S. and Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, Tucson, Arizona, United States. ACM Press. [Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 493–504, Montreal, Quebec, Canada. ACM Press. [Garofalakis and Kumar, 2003] Garofalakis, M. and Kumar, A. (2003). Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 143–154, San Diego, California. ACM Press. [Garofalakis and Kumar, 2005] Garofalakis, M. and Kumar, A. (2005). XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332. [Jiang et al., 1995] Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of trees—an alternative to tree

  • edit. Theoretical Computer Science, 143(1):137–148.

[Klein, 1998] Klein, P . N. (1998). Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 91–102, Venice, Italy. Springer.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 22

slide-23
SLIDE 23

[Lee et al., 2004] Lee, K.-H., Choy, Y.-C., and Cho, S.-B. (2004). An efficient algorithm to compute differences between structured documents. IEEE Transactions on Knowledge and Data Engineering, 16(8):965–979. [Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88. [Selkow, 1977] Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186. [Tanaka and Tanaka, 1988] Tanaka, E. and Tanaka, K. (1988). The tree-to-tree editing problem. Int. Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2(2):221–240. [Ukkonen, 1992] Ukkonen, E. (1992). Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211. [Valiente, 2001] Valiente, G. (2001). An efficient bottom-up distance between trees. In Proceedings of the 8th Symposium on String Processing and Information Retrieval, pages 212–219, Laguna de San Rafael,

  • Chile. IEEE Computer Science Press.

[Yang, 1991] Yang, W. (1991). Identifying syntactic differences between two programs. Software—Practice & Experience, 21(7):739–755. [Zhang and Shasha, 1989] Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262.

VLDB 2005, Trondheim Nikolaus Augsten, Michael B¨

  • hlen, Johann Gamper

Page 23