Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33
Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33
Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) ✓ triangle inequality: dist pq ( x , z ) ≤ dist pq ( x , y ) + dist pq ( y , z ) Different trees may be at distance zero: b b b b b b b b Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33
Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) ✓ triangle inequality: dist pq ( x , z ) ≤ dist pq ( x , y ) + dist pq ( y , z ) Different trees may be at distance zero: b b b b b b b b Runtime for the distance computation is O ( n log n ). Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 11 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting Sorting the Tree? Idea: 1. sort the children of each node by their label 2. apply an ordered tree distance T 1 e f T srt a b 1 g d sort j → c a c b b k f d e f g b k j f h i h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting Sorting the Tree? Idea: 1. sort the children of each node by their label 2. apply an ordered tree distance T 1 e f T srt a b 1 g d sort j → c a c b b k f d e f g b k j f h i h i ✘ Edit distance : tree sorting does not work ✓ Windowed pq -Grams : tree sorting works OK Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i sort sort a a c c b b b b d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : edit distance O ( n ) for identical trees a e f f b b g d unordered h i j j a c c k k f edit dist = 0 b e f b g d h i sort sort a a ordered c c b b b b edit dist = O ( n ) d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 e f e f a b 1 rename g d g d j j a c a c k k f f b b h i h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 e f e f a b 1 rename g d g d j j a c a c k k f f b b h i h i sort sort a a dist = 1 a c c b b b d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : T 2 T 2 T 2 e f e f e f a x b 1 rename 1 rename g d g d g d j j j a c a c a c k k k f f f b b b h i h i h i sort sort sort a a a dist = 1 x a c c c b b b b d e f g d e f g d e f g k j k j k j f h i f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : edit distance depends on node label T 2 T 2 T 2 e f e f e f a x b 1 rename 1 rename g d g d g d j j j a c a c a c k k k f f f b b b h i h i h i sort sort sort a a a dist = O ( n ) dist = 1 x a c c c b b b b d e f g d e f g d e f g k j k j k j f h i f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✓ Windowed pq -Grams: Tree Sorting Works OK Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O ( k ) windowed pq-grams change. • Proof (idea) : stem pq -grams consist of a stem and a base • stems are invariant to the sibling order • • • bases: only the O ( k ) pq -grams with the reordered nodes base in the bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33
Windowed pq -Grams for Data-Centric XML Tree Sorting ✓ Windowed pq -Grams: Tree Sorting Works OK Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O ( k ) windowed pq-grams change. • Proof (idea) : stem pq -grams consist of a stem and a base • stems are invariant to the sibling order • • • bases: only the O ( k ) pq -grams with the reordered nodes base in the bases change ✓ Non-unique sortings are equivalent: distance is 0 for identical trees ✓ Node renaming is independent of the node label Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 16 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Stems : ignore sibling order a a a c c * b c − → a c e b d d e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: • p = 2 not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Stems : ignore sibling order a a a c c * b c − → a c e b d d e Bases: do not ignore sibling order! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Requirements for Bases Requirements for bases: detection of node moves robustness to different sortings balanced node weight Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Requirements for Bases Requirements for bases: detection of node moves robustness to different sortings balanced node weight Our solution : windows : simulate all permutations within a window wrapping : wrap windows that extend beyond the right border dummies : extend small sibling sets with dummy nodes Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 1: Form bases from a sorted sibling sequence Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 2: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 3: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 Example : stem, sorted sibling sequence, window w = 3 a b c d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 4: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a b c − → c d e d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 5: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 6: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 7: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a b c − → c c d e d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 8: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a b c − → c c c c d e e * e d d * d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 9: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a a a b c − → c c c c c c d e e * e d * e d * * d d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Solution: Windowed pq -Gram Bases Algorithm 10: Form bases from a sorted sibling sequence if sibling sequence < window then extend with dummy nodes; 1 initialize window: start with leftmost node; 2 repeat 3 form bases in window: all q -permutations that contain start node; 4 shift window to the right by one node; 5 if window extends the right border then wrap window; 6 until processed all window positions 7 Example : stem, sorted sibling sequence, window w = 3 a a a a a a a b c − → c c c c c c d e e * e d * e d * * d d e * Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 19 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) 2. Robustness to different sortings: (k edit operations) base error ǫ ≤ 2 k f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Optimal Windowed pq -Grams Theorem (Optimal Windowed pq -Grams) For trees with fanout f , windowed pq-grams with base size q = 2 and window size w = f +1 have the following properties: 2 1. Detection of node moves: base recall ρ = 1 (all sibling pairs are encoded) base precision π = 1 (each pair is encoded only once) 2. Robustness to different sortings: (k edit operations) base error ǫ ≤ 2 k f 3. Balanced node weight: Each non-root node appears in exactly 2 w − 2 bases. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 20 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change Single Node: c, d, e no bases change c, d, e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e Window: cd, c*, d*, dc, *c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . . Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Detection of Node Moves Single Node: each node forms a base of size q = 1 Window: q ≥ 2 nodes of a window form a base a a 1 node move b b b b c e c e d d Goal: bases must change ✘ Single Node: c, d, e no bases change c, d, e ✓ Window: cd, c*, d*, dc, *c, *d, e*, . . . 33% bases change c*, c*, **, *c, *c, **, de, . . . Windowed pq -grams detect node moves. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 21 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A Consecutive: ad db 50% bases change ad dc Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B ad ab db. . . 33% bases change ad ac dc. . . Sort A Window: ad ab db. . . 33% bases change ad ac dc. . . Sort B Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Robustness to Different Sortings Consecutive siblings form a base (no permutation) Window: all sibling permutations within the window form bases 1 rename a a x x d d c b Sorting A Sorting B Sorting A Sorting B x x x x a c d a d c a b d a d b Goal: Same number of bases change for both sortings. ab bc 100% bases change ac cd Sort A ✘ Consecutive: ad db 50% bases change ad dc Sort B ad ab db. . . 33% bases change ad ac dc. . . Sort A ✓ Window: ad ab db. . . 33% bases change ad ac dc. . . Sort B Windowed pq -grams: Robust to different sortings. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 22 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. Permutations: 60/137 bases change 6/137 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Window: only permutations within window form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change Window: 12/51 bases change 12/51 bases change Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Windowed pq -Grams for Data-Centric XML Forming Bases Illustration: Balancing the Node Weight Permutations : all permutations of size q form a base Window: only permutations within window form a base a c b 1 rename 1 rename d e f g h i m n o a a c c b b e f g h i d e f g h i x m n o x n o Goal: Same number of bases change for both renames. ✘ Permutations: 60/137 bases change 6/137 bases change ✓ Window: 12/51 bases change 12/51 bases change Windowed pq -grams: Node weight is independent of sibling number. Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 23 / 33
Efficient Approximate Joins with Windowed pq -Gram Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 24 / 33
Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F tree tid tid tree a x T ′ b c d 1 y T 1 w e v z d a T ′ 2 T 2 a h i b c b x a y T ′ T 3 w e b h 3 w z Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33
Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33
Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F threshold=2 tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees 2. return document pairs within threshold Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33
Efficient Approximate Joins with Windowed pq -Gram Approximate Join F ′ F threshold=2 tree tid tid tree a x T ′ b c d 6 1 y T 1 w 5 e 2 v z 5 d a T ′ 4 2 T 2 a h i b c b 1 x 3 5 a 5 y T ′ T 3 w e b h 3 w z Simple approach: distance join 1. compute distance between all pairs of trees 2. return document pairs within threshold Very expensive : N 2 distance computations! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 25 / 33
Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33
Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Sort-merge and hash join: first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function Sort-merge and hash not applicable to distance join : there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33
Efficient Approximate Joins with Windowed pq -Gram Usual Join Optimization Does not Apply Distance join: expensive nested loop join: evaluate distance function between every input pair Equality join: efficient implementation as sort-merge or hash join Sort-merge and hash join: first step: treat each join attribute in isolation (sort/hash) second step: evaluate equality function Sort-merge and hash not applicable to distance join : there is no sorting that groups similar trees there is no hash function that partitions similar trees into buckets Solution: reduce distance join to equality join on pq -grams Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 26 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d { 1 , 0 } b { 5 , 5 } e { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 0 b 0 f 1 a 1 d 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f : Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e | b ∩ d | 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Efficient Approximate Joins with Windowed pq -Gram Reducing a Distance Join to an Equality Join Distance join between trees: N 2 intersections between integer bags { 1 , 7 } a { 1 , 7 } d | a ∩ d | = 2 | a ∩ e | = 0 | a ∩ f | = 0 : { 1 , 0 } b { 5 , 5 } e | b ∩ d | = 1 | b ∩ e | = 0 | b ∩ f | = 1 : : | c ∩ d | = 0 | c ∩ e | = 0 | c ∩ f | = 0 { 4 , 6 } c { 0 , 8 } f : Optimized pq -gram join : empty intersections are never computed! 1. union { 1 a , 7 a , 1 b , 0 b , 4 c , 6 c } { 1 d , 7 d , 5 e , 5 e , 0 f , 8 f } 2. sort 3. merge-join 0 b 0 f 1 a 1 d | b ∩ f | 1 b 5 e | a ∩ d | 4 c 5 e | b ∩ d | 6 c 7 d 7 a 8 f Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 27 / 33
Experiments Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 28 / 33
Experiments Effectiveness of the Windowed pq -Gram Join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33
Experiments Effectiveness of the Windowed pq -Gram Join Experiment: match DBLP articles 100 add noise to articles 80 (missing elements and spelling mistakes) recall [%] 60 approximate join between original and 40 noisy data threshold=0.3 20 threshold=0.5 measure precision and recall threshold=0.7 0 for different thresholds 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes 100 80 precision [%] 60 40 threshold=0.3 20 threshold=0.5 threshold=0.7 0 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Windowed pq -grams are effective for data-centric XML Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33
Experiments Effectiveness of the Windowed pq -Gram Join Experiment: match DBLP articles 100 add noise to articles 80 (missing elements and spelling mistakes) recall [%] 60 approximate join between original and 40 noisy data threshold=0.3 20 threshold=0.5 measure precision and recall threshold=0.7 0 for different thresholds 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Datasets: 100 DBLP: articles 80 depth 1.9, 15 nodes (max 1494 nodes) precision [%] 60 SwissProt: protein descriptions 40 depth 3.5, 104 nodes (max 2640 nodes) threshold=0.3 20 threshold=0.5 Treebank: tagged English sentences threshold=0.7 0 depth 6.9 (max depth 30), 43 nodes 0 0.1 0.2 0.3 0.4 0.5 0.6 percentage of changed nodes Windowed pq -grams are effective for data-centric XML Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 29 / 33
Experiments Efficiency of the Optimized pq -Gram Join Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33
Experiments Efficiency of the Optimized pq -Gram Join Optimized pq -gram join: very efficient number of trees 250 500 750 1000 nested-loop join 2000 optimized join compute nested-loop join between trees 1500 time [sec] compute optimized pq -gram join 1000 between trees 500 measure wallclock time 0 1e+06 2e+06 number of nodes Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 30 / 33
Related Work Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 31 / 33
Recommend
More recommend