TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson - - PowerPoint PPT Presentation

tasm top k approximate subtree matching
SMART_READER_LITE
LIVE PREVIEW

TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson - - PowerPoint PPT Presentation

TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 ohlen 3 Themis Palpanas 4 Michael B 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta, Canada denilson@cs.ualberta.ca 3


slide-1
SLIDE 1

TASM: Top-k Approximate Subtree Matching

Nikolaus Augsten1 Denilson Barbosa2 Michael B¨

  • hlen3

Themis Palpanas4

1Free University of Bozen-Bolzano, Italy

augsten@inf.unibz.it

2University of Alberta, Canada

denilson@cs.ualberta.ca

3University of Zurich, Switzerland

boehlen@ifi.uzh.ch

4University of Trento, Italy

themis@disi.unitn.eu

ICDE 2010, March 3 Long Beach, CA, USA

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

slide-2
SLIDE 2

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 2 / 28

slide-3
SLIDE 3

Motivation and Problem Definition

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 3 / 28

slide-4
SLIDE 4

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML) article authors author Tim author John booktitle ICDE DBLP 28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document! Example Answer: k = 3

inproceedings authors author Tim author John booktitle ICDE article author Tim authors author John booktitle TKDE inproceedings authors author Tim author John author Peter booktitle ICDE (1 error) (2 errors) (3 errors)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

slide-5
SLIDE 5

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching) Given: query tree Q, document tree T, size k of ranking Goal: Compute a top-k ranking R = (T1, T2, . . . , Tk)

  • f all subtrees Ti of document T

with respect to query Q using the tree edit distance for the ranking. Subtree Ti:

a node and all its descendants largest subtree is document itself

top-k ranking R = (T1, Ti, . . . , Tk )

subtrees sorted by distance to query best k subtrees: Ti / ∈ R ⇒ ted(Q, Tk ) ≤ ted(Q, Ti )

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

slide-6
SLIDE 6

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article authors author Tim author John booktitle ICDE article author Tim author John booktitle ICDE article author Tim author John booktitle TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations (insert, rename, delete) that transform one tree into the other. TASM computes TED between query and document subtrees Size and number of computed subtrees define TASM complexity

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

slide-7
SLIDE 7

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the document use smaller subtrees to compute larger ones rank subtrees by visiting memoization table Space complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!) TASM-Dynamic maintains two m × n matrixes in RAM > 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

For database size solutions dynamic programming is too expensive. State-of-the-art algorithms do not scale!

1Zhang and Shasha 1989, Demaine et al. 2007 Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

slide-8
SLIDE 8

Motivation and Problem Definition

Problem Definition

Find a solution for TASM (Top-k Approximate Subtree Matching) that scales to very large documents runs in small memory ranks subtrees correctly (no heuristics!)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 8 / 28

slide-9
SLIDE 9

TASM-Postorder

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 9 / 28

slide-10
SLIDE 10

TASM-Postorder Upper Bound on Subtree Size

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 10 / 28

slide-11
SLIDE 11

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

  • 1. Rank first k subtrees of T in postorder: R′ = (T ′

1, T ′ 2, . . . , T ′ k)

Q

T ′

k

|T ′

k| ≤ k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′ k|

  • 2. Final ranking R = (T1, T2, . . . , Tk) (=TASM result)

Ti’s in R are better than worst match T ′

k of R′

(ii) ted(Q, Ti) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′ k|

  • 3. Size upper bound for subtree Ti

|Ti| − |Q| ≤ ted(Q, Ti)

Q Ti

at least: insert missing nodes |Ti| − |Q|

|Ti| ≤ ted(Q, Ti) + |Q| ≤ 2|Q| + |T ′

k| ≤ 2|Q| + k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

slide-12
SLIDE 12

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size) TASM needs to consider only small document subtrees of size τ or less: τ = 2|Q| + k Upper bound is very powerful: independent of document size and structure! linear in query size and k Example: top-10 with example query |Q| = 8 on DBLP (28M nodes) with bound: max subtree size τ = 2 ∗ 8 + 10 = 26 without bound: maximum subtree size is 28M (whole document)! Document-independent upper bound on subtree size!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

slide-13
SLIDE 13

TASM-Postorder Prefix Ring Buffer Pruning

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 13 / 28

slide-14
SLIDE 14

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp article auth John title X1 proceedings conf VLDB article auth Peter title X3 article auth Mike title X4 book title X2 John,1 auth,2 X1,1 title,2 article,5 VLDB,1 conf,2 Peter,1 auth,2 X3,1 title,2 article,5 Mike,1 auth,2 X4,1 title,2 article,5 proc,13 X2,1 title,2 book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1) no random access!

Relevant and state-of-the-art for XML Parsing

full subtree known only at closing tag closing tags appear in postorder

Implementation is efficient and heavily used for

XML streams plain XML files (e.g., SAX) XML in database (Dewey, interval encoding, ...)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

slide-15
SLIDE 15

TASM-Postorder Prefix Ring Buffer Pruning

Candidate Subtrees

Candidate subtrees are all subtrees Ti of the document with

|Ti| ≤ τ AND Ti is not contained in a larger subtree |Tj| ≤ τ

Pruning: find candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

slide-16
SLIDE 16

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22 article5 auth2 John1 title4 X13 proceedings18 conf7 VLDB6 article12 auth9 Peter8 title11 X310 article17 auth14 Mike13 title16 X415 book21 title20 X219

Simple pruning approach: (τ = 6 in example above)

add nodes to memory buffer until non-candidate (|Ti| > τ) is added subtrees of non-candidate with |Ti| ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!

must keep subtrees in memory until non-candidate ancestor is read worst case: memory buffer stores O(n) nodes (frequent in data-centric XML!)

Example: DBLP, τ = 50

99% of nodes are still in buffer when root node is read!

Simple pruning not feasible for large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

slide-17
SLIDE 17

TASM-Postorder Prefix Ring Buffer Pruning

Efficient Pruning is Tricky!

Problem: when can we remove a node from the buffer?

when we see |Ti| ≤ τ, we don’t yet know about parent (postorder!) subtree of parent might be smaller than τ!

Our Solution does not wait for parent

prefix ring buffer: fixed size buffer pruning rule: prune based on following nodes

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

slide-18
SLIDE 18

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

VLDB,1 e↑ s↑ John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory) stores prefix (τ nodes in postorder) of the document two operations append new node remove leftmost subtree/node Pruning rule: If leftmost node in full ring buffer is leaf: leftmost subtree is candidate subtree non-leaf: leftmost node is non-candidate node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

slide-19
SLIDE 19

TASM-Postorder Prefix Ring Buffer Pruning

Pruning Rule – Intuition

Candidate subtree: leftmost node is a leaf

Ti: leftmost subtree, starts with leftmost node Tj: smallest subtree that contains Ti due to postorder: Tj contains all nodes in buffer since |Ti| ≤ τ and |Tj| > τ: Ti is a candidate

Non-candidate node: leftmost node is a non-leaf

leftmost non-leaf is parent of previously removed nodes we remove either candidate subtrees and non-candidate nodes in both cases: parent is a non-candidate

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

slide-20
SLIDE 20

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Example

dblp article auth John title X1 proceedings conf VLDB article auth Peter title X3 article auth Mike title X4 book title X2

  • 1. fill ring buffer
  • 2. check leftmost node

leaf: candidate subtree – to result non-leaf: non-candidate – remove

  • 3. until queue and buffer empty

τ = 6

postorder queue (input)

article,5 Mike,1 auth,2

· · · prefix ring buffer (main memory)

Peter,1 auth,2 X3,1 title,2 e↑ s↑ VLDB,1 conf,2

append candidate subtrees: (output)

article auth John title X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

slide-21
SLIDE 21

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

  • 1. empty ranking R, tightening upper bound τ ′= τ
  • 2. for each candidate subtree Ti
  • a. if |R| = k: update τ′ = min(τ, max(R) + |Q|)
  • b. compute tree edit distance for all subtrees of Ti within τ ′
  • c. update ranking R

Theorem (TASM-Postorder) The space complexity of TASM-postorder is independent of the document size: O(m2 + mk)

(m: query size, k: result size)

TASM-postorder scales to very large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

slide-22
SLIDE 22

Experiments

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 22 / 28

slide-23
SLIDE 23

Experiments

Pruning Effectiveness

Prefix ring buffer pruning is very effective!

Maximum subtree reduced from 37M to 18 nodes. Dataset: PSD protein sequences, 37M nodes, 683MB Compute TASM (|Q| = 4, k = 1) TASM-dynamic (state of the art) TASM-postorder (our solution) Histogram of computed subtrees

1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 number of subtrees subtree size (nodes) largest subtree: 37M entire document TASM-Dynamic 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 number of subtrees subtree size (nodes) largest subtree: 18 TASM-Postorder

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

slide-24
SLIDE 24

Experiments

Scalability: TASM-Postorder vs. TASM-Dynamic

TASM-postorder much faster than TASM-dynamic.

Dataset: XMark (synthetic XML for benchmark) Vary query size and document size Compute TASM (k = 5) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time

1e0 1e1 1e2 1e3 4 8 16 32 64 time (seconds) query size (nodes) dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB 1e0 1e1 1e2 1e3 112 224 448 896 1792 time (seconds) document size (MB) dyn, |Q|=8 dyn, |Q|=4 pos, |Q|=8 pos, |Q|=4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

slide-25
SLIDE 25

Experiments

Scalability with Result Size k

TASM-postorder scales well with k.

Increasing k by 4 orders of magnitude only doubles runtime.

50 100 150 200 250 300 1e0 1e1 1e2 1e3 1e4 time (seconds) k dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB

Dataset: XMark (synthetic XML for benchmark) Vary k (size of ranking) Compute TASM (|Q| = 16) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

slide-26
SLIDE 26

Experiments

Space complexity: TASM-Postorder vs. TASM-Dynamic

TASM-postorder: space independent of document!

1e0 1e1 1e2 1e3 4e3 112 224 448 896 1792 memory (MB) document size (MB) 3GB 8MB dyn, |Q|=16 dyn, |Q|=4 pos, |Q|=16 pos, |Q|=4

Dataset: XMark (synthetic XML for benchmark) Vary document size Compute TASM (k = 5) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure main memory usage

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

slide-27
SLIDE 27

Conclusion and Future Work

Outline

1 Motivation and Problem Definition 2 TASM-Postorder

Upper Bound on Subtree Size Prefix Ring Buffer Pruning

3 Experiments 4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 27 / 28

slide-28
SLIDE 28

Conclusion and Future Work

Conclusion

Conclusion Prefix Ring Buffer for space efficient pruning Dynamic programming does not scale for database size solutions. Upper bound τ τ τ: limit maximum subtree size for TASM TASM-postorder: highly scalable TASM algorithm TASM-postorder makes TASM feasible. Future Work – New research opportunities: tune tree edit distance to different applications index the document: can we avoid a document scan? parallel TASM algorithm: where to split document?

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

slide-29
SLIDE 29

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposition algorithm for tree edit distance. In ICALP, volume 4596 of LNCS, pages 146–157, Wroclaw, Poland, July 2007. Springer.

  • K. Zhang and D. Shasha.

Simple fast algorithms for the editing distance between trees and related problems. SIAM J. on Computing, 18(6):1245–1262, 1989.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28