cross document pattern matching
play

Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 - PowerPoint PPT Presentation

Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 Tatiana Starikovskaya 3 , 1 1 Universit e Paris-Est & CNRS, 2 University of Chile, 3 Lomonosov Moscow State University. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)


  1. Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 Tatiana Starikovskaya 3 , 1 1 Universit´ e Paris-Est & CNRS, 2 University of Chile, 3 Lomonosov Moscow State University. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 1 / 18

  2. Pattern Matching Problem Given a text T and a pattern P , count all occurrences of P in T . ST ( T ) P u = locus ( P ) O ( | P | ) time, O ( | T | ) space G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 2 / 18

  3. Cross-Document Pattern Matching Problem Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count all occurrences of P in T ℓ . Example documents: genomic sequences pattern: a fragment of one of the sequences G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 3 / 18

  4. Cross-Document Pattern Matching Problem Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count all occurrences of P in T ℓ . ST ( T ℓ ) P u = locus ( P ) Standard solution: O ( | P | ) time, O ( | T ℓ | ) space Faster solution? Yes. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 3 / 18

  5. Variants ◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 4 / 18

  6. Variants ◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 4 / 18

  7. Counting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count all occurrences of P in T ℓ . 1) identify a position p of some occurrence of P in T ℓ 2) find the locus of T ℓ [ p .. p + | P | − 1] in ST ( T ℓ ), and retrieve the number of leaves in its subtree ST ( T ℓ ) P = T ℓ [ p .. p + | P | − 1] u = locus ( P ) G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 5 / 18

  8. Counting: step 1 1) identify a position p of some occurrence of P in T ℓ r 1 r 2 r T ℓ [ p 1 .. ] T k [ i .. ] T ℓ [ p 2 .. ] GSA D ℓ ℓ k p 1 , p 2 : starting positions of the closest to T k [ i .. ] suffixes of T ℓ r 1 = select ( ℓ, rank ( D [1 .. r − 1] , ℓ )) r 2 = select ( ℓ, rank ( D [1 .. r − 1] , ℓ ) + 1) [Golynski et al. 2006] Rank and select queries on D can be supported in O (1) and O (log log m ) time respectively. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 6 / 18

  9. Counting: step 1 1) identify a position p of some occurrence of P in T ℓ r 1 r 2 r T ℓ [ p 1 .. ] T k [ i .. ] T ℓ [ p 2 .. ] GSA D ℓ ℓ k ⇒ Positions p 1 and p 2 can be computed in O (log log m ) time. P occurs at p 1 ⇔ lcp ( T ℓ [ p 1 .. ] , T k [ i .. ]) ≥ | P | . Step 1 takes O (log log m ) time . G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 6 / 18

  10. Counting: step 2 2) find the locus of T ℓ [ p .. p + | P | − 1] in ST ( T ℓ ), and retrieve the number of leaves in its subtree ST ( T ℓ ) P = T ℓ [ p .. p + | P | − 1] u = locus ( P ) v = locus ( T ℓ [ p .. ]) weight ( w ): string depth of a node w w = wla ( v , q ): the ancestor of v of minimal depth s.t. weight ( w ) ≥ q u = wla ( v , | P | ) G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 7 / 18

  11. Weighted Level Ancestor Problem [Farach et al. 1996, Amir et al. 2007] w = wla ( v , q ) can be found in O (log log W ) time and linear space, where W is the maximal weight of a node in the tree. Theorem � w = wla ( v , q ) can be found in O (min { log n w / log log n w , log log q } ) time and linear space. w n w leaves G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 8 / 18

  12. Counting: step 2 2) find the locus of T ℓ [ p .. p + | P | − 1] in ST ( T ℓ ), and retrieve the number of leaves in its subtree ST ( T ℓ ) P = T ℓ [ p .. p + | P | − 1] u = locus ( P ) v = locus ( T ℓ [ p .. ]) u = wla ( v , | P | ) , n u = occ � ⇒ u can be found in min { log occ / log log occ , log log | P |} time . G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 9 / 18

  13. Counting: step 2 2) find the locus of T ℓ [ p .. p + | P | − 1] in ST ( T ℓ ), and retrieve the number of leaves in its subtree ST ( T ℓ ) P = T ℓ [ p .. p + | P | − 1] u = locus ( P ) v = locus ( T ℓ [ p .. ]) Theorem Counting takes O ( t + log log m ) time and O ( n ) space, where � t = min { log occ / log log occ , log log | P |} . G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 9 / 18

  14. Variants ◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 10 / 18

  15. Reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], report all occurrences of P in T ℓ . 1) identify a position p of T ℓ at which P occurs Step 1 of Counting, takes O (log log m ) time 2) report all s : lcp ( T ℓ [ p .. ] , T ℓ [ s .. ]) ≥ | P | T ℓ [ s .. ] P ⇔ lcp ( T ℓ [ p .. ] , T ℓ [ s .. ]) ≥ | P | T ℓ [ p .. ] P G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 11 / 18

  16. Reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], report all occurrences of P in T ℓ . 1) identify a position p of T ℓ at which P occurs Step 1 of Counting, takes O (log log m ) time 2) report all s : lcp ( T ℓ [ p .. ] , T ℓ [ s .. ]) ≥ | P | T ℓ [ s .. ] P ⇔ lcp ( T ℓ [ p .. ] , T ℓ [ s .. ]) ≥ | P | T ℓ [ p .. ] P G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 11 / 18

  17. Reporting: step 2 2) report all s : lcp ( T ℓ [ p .. ] , T ℓ [ s .. ]) ≥ | P | SA ( T ℓ ) T ℓ [ p .. ] while lcp ( T ℓ [ s .. ] , T ℓ [ p .. ]) ≥ | P | , report s Theorem Reporting takes O (log log m + occ ) time and O ( n ) space. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 12 / 18

  18. Variants ◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 13 / 18

  19. Document counting and reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count or report all documents in which P occurs. 1) find u = locus ( P ) in the generalized suffix tree Reduction to the WLA Problem � O (min { log docc / log log docc , log log | P |} ) time 2) report or count distinct documents in the subtree of u GST P = T k [ i .. j ] u = locus ( P ) v = locus ( T k [ i .. ]) D G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 14 / 18

  20. Document counting and reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count or report all documents in which P occurs. 1) find u = locus ( P ) in the generalized suffix tree Reduction to the WLA Problem � O (min { log docc / log log docc , log log | P |} ) time 2) report or count distinct documents in the subtree of u GST P = T k [ i .. j ] u = locus ( P ) v = locus ( T k [ i .. ]) D G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 14 / 18

  21. Document counting and reporting: step 2 2) report or count distinct documents in the subtree of u ⇔ report or count distinct documents in the corresponding segment of the document array D [Muthukrishnan 2002] Reporting of distinct documents in a segment of D takes O ( ndocs ) time and O ( n ) space. Theorem Document reporting takes O ( t + ndocs ) time and O ( n ) space, where � t = min { log docc / log log docc , log log | P |} . [Bozanis et al. 1995] Counting of distinct documents in a segment of D takes O (log n ) time and O ( n ) space. Theorem Document counting takes O (log n ) time and O ( n ) space. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 15 / 18

  22. Variants ◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 16 / 18

  23. Dynamic counting and reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count all occurrences of P in T ℓ . Dynamic operation: adding a document. 1) find a position p of some occurrence of P in T ℓ 2) find the locus of T ℓ [ p .. p + | P | − 1] in ST ( T ℓ ), and retrieve the number of leaves in its subtree G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 17 / 18

  24. Dynamic counting and reporting Given a set of documents T 1 , T 2 , . . . , T m and a pattern P = T k [ i .. j ], count all occurrences of P in T ℓ . Dynamic operation: adding a document. 1) find a position p of some occurrence of P in T ℓ — O (log n ) time G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst) Cross-Document Pattern Matching CPM 2012 17 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend