approximate joins for data centric xml
play

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 - PowerPoint PPT Presentation

Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 Curtis Dyreson 2 Johann Gamper 1 Michael B 1 Free University of Bozen-Bolzano Bolzano, Italy { augsten,boehlen,gamper } @inf.unibz.it 2 Utah State University Logan, UT, U.S.A.


  1. Approximate Joins for Data-Centric XML Nikolaus Augsten 1 ohlen 1 Curtis Dyreson 2 Johann Gamper 1 Michael B¨ 1 Free University of Bozen-Bolzano Bolzano, Italy { augsten,boehlen,gamper } @inf.unibz.it 2 Utah State University Logan, UT, U.S.A. curtis.dyreson@usu.edu April 10, 2008 ICDE, Canc´ un, Mexico Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 1 / 33

  2. Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 2 / 33

  3. Motivation Approximate Join on Music CDs Song Lyric Store CD Warehouse album album year price track track track 10 2000 title artist artist title artist title So Far Mark Roger Breathe Neil Alabama album album title price track track track 15 Harvest title artist artist title artist title Alabama Neil Roger Breathe Mark So Far Query: Give me all album pairs that represent the same music CDs. How similar are two XML items? Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 3 / 33

  4. Motivation How Similar Are these XMLs? album album year price track track track track 15 2000 title artist artist title artist title artist title So Far Mark Roger Breathe Roger Breathe Mark So Far Standard solution O ( n 3 ) : tree edit distance Minimum number of node edit operations (insert, delete, rename) that transforms one ordered tree into the other. Problem: permuted subtrees are deleted/re-inserted node by node Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 4 / 33

  5. Motivation Ordered vs. Unordered Trees a a � = Ordered Trees c c b b sibling order matters e d d e ignore order Unordered Trees a b a b = = data-centric XML c e d c e d sibling order ignored Edit distance between unordered trees: NP-complete → all sibling permutations must be considered! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 5 / 33

  6. Motivation Problem Definition Find an effective distance for the approximate matching of hierar- chical data represented as unordered labeled trees that is efficient for approximate joins . Naive approaches that fail: unordered tree edit distance: NP-complete allow subtree move: NP-hard compute minimum distance between all permutations: O ( n !) sort by label and use ordered tree edit distance: error O ( n ) Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 6 / 33

  7. Windowed pq -Grams for Data-Centric XML Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 7 / 33

  8. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Our Solution: Windowed pq -Grams • stem p = 2 • • • • Windowed pq-Gram: small subtree with stem and base base q = 3 Key Idea: split unordered tree into set of windowed pq -grams that is not sensitive to the sibling order sensitive to any other change in the tree Intuition: similar unordered trees have similar windowed pq -grams Systematic computation of windowed pq -grams 1. sort the children of each node by their label (works OK for pq -grams) 2. simulate permutations with a window 3. split tree into windowed pq -grams Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 8 / 33

  9. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Implementation of Windowed pq -Grams Set of windowed pq -grams: a a a a a a a a c c * * * * * * c − → a a a a a a c c c c c c e b b d b c c * c b * c d e e * e d * e b * * b * * d * * d * * * * d e Hashing: map pq -gram to integer: label l h ( l ) * * 0 serialize a 9 ( shorthand ) a hash → ( * , a , b , c ) → → 0973 *abc b 7 c 3 b c . . . . . . Note: labels may be strings of arbitrary length! pq -Gram index: bag of hashed pq -grams I ( T ) = { 0973 , 0970 , 0930 , 0937 , 0907 , 0903 , 9700 , 9316 , 9310 , 9360 , 9361 , 9301 , 9306 , 3100 , 3600 } Tree is represented by a bag of integers! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 9 / 33

  10. Windowed pq -Grams for Data-Centric XML Windowed pq -Grams The Windowed pq -Gram Distance The windowed pq -gram distance between two trees, T and T ′ : dist pq ( T , T ′ ) = |I ( T ) ⊎ I ( T ′ ) | − 2 |I ( T ) ∩ I ( T ′ ) | Pseudo-metric properties hold: ⇐ ⇒ dist pq ( x , y ) = 0 ✓ self-identity: x = y / I ( T ) I ( T ′ ) ✓ symmetry: dist pq ( x , y ) = dist pq ( y , x ) ✓ triangle inequality: dist pq ( x , z ) ≤ dist pq ( x , y ) + dist pq ( y , z ) Different trees may be at distance zero: b b b b b b b b Runtime for the distance computation is O ( n log n ). Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 10 / 33

  11. Windowed pq -Grams for Data-Centric XML Tree Sorting Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 11 / 33

  12. Windowed pq -Grams for Data-Centric XML Tree Sorting Sorting the Tree? Idea: 1. sort the children of each node by their label 2. apply an ordered tree distance T 1 e f T srt a b 1 g d sort j → c a c b b k f d e f g b k j f h i h i ✘ Edit distance : tree sorting does not work ✓ Windowed pq -Grams : tree sorting works OK Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 12 / 33

  13. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 1. Non-unique sorting : edit distance O ( n ) for identical trees a e f f b b unordered g d h i j j a c c k k edit dist = 0 f b e f b g d h i sort sort a a ordered c c b b b b edit dist = O ( n ) d e f g d e f g k j k j f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 13 / 33

  14. Windowed pq -Grams for Data-Centric XML Tree Sorting ✘ Edit Distance: Tree Sorting Does Not Work 2. Node renaming : edit distance depends on node label T 2 T 2 T 2 e f e f e f a x b 1 rename 1 rename g d g d g d j j j a c a c a c k k k f f f b b b h i h i h i sort sort sort a a a dist = O ( n ) dist = 1 x a c c c b b b b d e f g d e f g d e f g k j k j k j f h i f h i f h i Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 14 / 33

  15. Windowed pq -Grams for Data-Centric XML Tree Sorting ✓ Windowed pq -Grams: Tree Sorting Works OK Theorem (Local Effect of Node Reordering) If k children of a node are reordered, i.e., their subtrees are moved, only O ( k ) windowed pq-grams change. • Proof (idea) : stem pq -grams consist of a stem and a base • stems are invariant to the sibling order • • • bases: only the O ( k ) pq -grams with the reordered nodes base in the bases change ✓ Non-unique sortings are equivalent: distance is 0 for identical trees ✓ Node renaming is independent of the node label Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 15 / 33

  16. Windowed pq -Grams for Data-Centric XML Forming Bases Outline 1 Motivation 2 Windowed pq -Grams for Data-Centric XML Windowed pq -Grams Tree Sorting Forming Bases 3 Efficient Approximate Joins with Windowed pq -Gram 4 Experiments 5 Related Work 6 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 16 / 33

  17. Windowed pq -Grams for Data-Centric XML Forming Bases How To Form Bases? • stem Goal for windowed pq -grams: p = 2 • not sensitive to the sibling order • • • sensitive to any other change in the tree base q = 3 Stems : ignore sibling order a a a c c * b c − → a c e b d d e Bases: do not ignore sibling order! Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 17 / 33

  18. Windowed pq -Grams for Data-Centric XML Forming Bases Requirements for Bases Requirements for bases: detection of node moves robustness to different sortings balanced node weight Our solution : windows : simulate all permutations within a window wrapping : wrap windows that extend beyond the right border dummies : extend small sibling sets with dummy nodes Nikolaus Augsten (Bolzano, Italy) Approximate Joins for Data-Centric XML ICDE 2008 – Canc´ un, Mexico 18 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend