approximate joins for data centric xml
play

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 - PowerPoint PPT Presentation

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 Curtis Dyreson 2 Johann Gamper 1 Michael B 1 Free University of Bozen-Bolzano Bolzano, Italy { augsten,boehlen,gamper } @inf.unibz.it 2 Utah State University Logan, UT, U.S.A.


  1. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

  2. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

  3. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) ✓ triangle inequality: dist pq ( x , z ) ≤ dist pq ( x , y ) + dist pq ( y , z ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

  4. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) ✓ triangle inequality: dist pq ( x , z ) ≤ dist pq ( x , y ) + dist pq ( y , z ) Different trees may be at distance zero: b b b b b b b b Runtime for the distance computation is O ( n log n ). Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

  5. Windowed pq -Grams for Data-Centric XML Tree Sorting Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 11 / 33

  6. Windowed pq -Grams for Data-Centric XML Tree Sorting Sorting the Tree? Idea: 1. sort the children of each node by their label 2. apply an ordered tree distance T 1 e f T srt a b 1 g d sort j → c a c b b k f d e f g b k j f h i h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33

  7. Windowed pq -Grams for Data-Centric XML Tree Sorting Sorting the Tree? Idea: 1. sort the children of each node by their label 2. apply an ordered tree distance T 1 e f T srt a b 1 g d sort j → c a c b b k f d e f g b k j f h i h i ✘ Edit distance : tree sorting does not work ✓ Windowed pq -Grams : tree sorting works OK Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33

  8. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

  9. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i sort sort a a c c b b b b d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

  10. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : edit distance O ( n ) for identical trees a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i sort sort a a ordered c c b b b b edit dist = O ( n ) d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

  11. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 e f e f a b 1 rename g d g d j j a c a c k k f f b b h i h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

  12. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 e f e f a b 1 rename g d g d j j a c a c k k f f b b h i h i sort sort a a dist = 1 a c c b b b d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

  13. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 T 2 e f e f e f a x b 1 rename 1 rename g d g d g d j j j a c a c a c k k k f f f b b b h i h i h i sort sort sort a a a dist = 1 x a c c c b b b b d e f g d e f g d e f g k j k j k j f h i f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

  14. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : edit distance depends on node label T 2 T 2 T 2 e f e f e f a x b 1 rename 1 rename g d g d g d j j j a c a c a c k k k f f f b b b h i h i h i sort sort sort a a a dist = O ( n ) dist = 1 x a c c c b b b b d e f g d e f g d e f g k j k j k j f h i f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

  15. Windowed pq -Grams for Data-Centric XML Tree Sorting ✓ Windowed pq -Grams: Tree Sorting Works OK Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O ( k ) windowed pq-grams change. • Proof (idea) : stem pq -grams consist of a stem and a base • stems are invariant to the sibling order • • • bases: only the O ( k ) pq -grams with the reordered nodes base in the bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33

  16. Windowed pq -Grams for Data-Centric XML Tree Sorting ✓ Windowed pq -Grams: Tree Sorting Works OK Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O ( k ) windowed pq-grams change. • Proof (idea) : stem pq -grams consist of a stem and a base • stems are invariant to the sibling order • • • bases: only the O ( k ) pq -grams with the reordered nodes base in the bases change ✓ Non-unique sortings are equivalent: distance is 0 for identical trees ✓ Node renaming is independent of the node label Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33

  17. Windowed pq -Grams for Data-Centric XML Forming Bases Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 16 / 33

  18. Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

  19. Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Stems : ignore sibling order a a a c c * b c − → a c e b d d e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

  20. Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Stems : ignore sibling order a a a c c * b c − → a c e b d d e Bases: do not ignore sibling order! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

  21. Windowed pq -Grams for Data-Centric XML Forming Bases Requirements for Bases Requirements for bases: detection of node moves robustness to different sortings balanced node weight Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33

  22. Windowed pq -Grams for Data-Centric XML Forming Bases Requirements for Bases Requirements for bases: detection of node moves robustness to different sortings balanced node weight Our solution : windows : simulate all permutations within a window wrapping : wrap windows that extend beyond the right border dummies : extend small sibling sets with dummy nodes Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33

  23. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 1: Form bases from a sorted sibling sequence Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  24. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 2: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  25. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 3: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  26. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 4: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a b c − → c d e d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  27. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 5: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  28. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 6: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  29. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 7: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  30. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 8: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a b c − → c c c c d e e * e d d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  31. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 9: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a a a b c − → c c c c c c d e e * e d * e d * * d d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  32. Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 10: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a a a b c − → c c c c c c d e e * e d * e d * * d d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33

  33. Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

  34. Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

  35. Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) 2. Robustness to different sortings: (k edit operations) base error ǫ ≤ 2 k f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

  36. Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) 2. Robustness to different sortings: (k edit operations) base error ǫ ≤ 2 k f 3. Balanced node weight: Each non-root node appears in exactly 2 w − 2 bases. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33

  37. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  38. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  39. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change Single Node: c, d, e no bases change c, d, e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  40. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  41. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e Window: cd, c*, d*, dc, *c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . . Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  42. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e ✓ Window: cd, c*, d*, dc, *c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . . Windowed pq -grams detect node moves. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33

  43. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  44. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  45. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A Consecutive: ad db 50% bases change ad dc Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  46. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  47. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B ad ab db. . . 33% bases change ad ac dc. . . Sort A Window: ad ab db. . . 33% bases change ad ac dc. . . Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  48. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B ad ab db. . . 33% bases change ad ac dc. . . Sort A ✓ Window: ad ab db. . . 33% bases change ad ac dc. . . Sort B Windowed pq -grams: Robust to different sortings. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33

  49. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  50. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  51. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. Permutations: 60/137 bases change 6/137 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  52. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  53. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Window: only permutations within window form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change Window: 12/51 bases change 12/51 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  54. Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Window: only permutations within window form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change ✓ Window: 12/51 bases change 12/51 bases change Windowed pq -grams: Node weight is independent of sibling number. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33

  55. Efficient Approximate Joins with Windowed pq -Gram Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 24 / 33

  56. Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F tree tid tid tree a x T ′ b c d 1 y T 1 w e v z d a T ′ 2 T 2 a h i b c b x a y T ′ T 3 w e b h 3 w z Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

  57. Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

  58. Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F threshold=2 tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees 2. return document pairs within threshold Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

  59. Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F threshold=2 tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees 2. return document pairs within threshold Very expensive : N 2 distance computations! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33

  60. Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

  61. Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Sort-merge and hash join: first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function Sort-merge and hash not applicable to distance join : there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

  62. Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Sort-merge and hash join: first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function Sort-merge and hash not applicable to distance join : there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets Solution: reduce distance join to equality join on pq -grams Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33

  63. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d { 1 , 0 } b { 5 , 5 } e { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  64. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  65. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  66. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  67. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  68. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 0 b 0 f 1 a 1 d 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  69. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  70. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  71. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  72. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f : Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e | b ∩ d | 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  73. Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f : Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e | b ∩ d | 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33

  74. Experiments Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 28 / 33

  75. Experiments Effectiveness of the Windowed pq -Gram Join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

  76. Experiments Effectiveness of the Windowed pq -Gram Join Experiment: match DBLP articles 100 add noise to articles 80 (missing elements and spelling mistakes) recall [%] 60 approximate join between original and 40 noisy data threshold=0.3 20 threshold=0.5 measure precision and recall threshold=0.7 0 for different thresholds 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes 100 80 precision [%] 60 40 threshold=0.3 20 threshold=0.5 threshold=0.7 0 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Windowed pq -grams are effective for data-centric XML Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

  77. Experiments Effectiveness of the Windowed pq -Gram Join Experiment: match DBLP articles 100 add noise to articles 80 (missing elements and spelling mistakes) recall [%] 60 approximate join between original and 40 noisy data threshold=0.3 20 threshold=0.5 measure precision and recall threshold=0.7 0 for different thresholds 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Datasets: 100 DBLP: articles 80 depth 1.9, 15 nodes (max 1494 nodes) precision [%] 60 SwissProt: protein descriptions 40 depth 3.5, 104 nodes (max 2640 nodes) threshold=0.3 20 threshold=0.5 Treebank: tagged English sentences threshold=0.7 0 depth 6.9 (max depth 30), 43 nodes 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Windowed pq -grams are effective for data-centric XML Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33

  78. Experiments Efficiency of the Optimized pq -Gram Join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33

  79. Experiments Efficiency of the Optimized pq -Gram Join Optimized pq -gram join: very efficient number of trees 250 500 750 1000 nested-loop join 2000 optimized join compute nested-loop join between trees 1500 time [sec] compute optimized pq -gram join 1000 between trees 500 measure wallclock time 0 1e+06 2e+06 number of nodes Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33

  80. Related Work Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 31 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend