full fl
play

Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets - PowerPoint PPT Presentation

Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets Gregory Kucherov CNRS/LIGM Marne-la-Vall ee, France Yakov Nekrich University of Kansas, USA ICALP13, July 11, 2013 Context and history Suffix Tree Supporting real-time


  1. Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets Gregory Kucherov CNRS/LIGM Marne-la-Vall´ ee, France Yakov Nekrich University of Kansas, USA ICALP’13, July 11, 2013

  2. Context and history Suffix Tree Supporting real-time String Matching and Indexing string matching : find all occurrences of a pattern P in a text T string matching : P is fixed (or given first) indexing : T is fixed (or given first) real-time processing : reading the data online and spending O (1) time on each character

  3. Context and history Suffix Tree Supporting real-time History of the Problem and Related Work

  4. Context and history Suffix Tree Supporting real-time Real-time string matching vs. Real-time indexing language { P # T : P occurs in T } can be recognized in real time by a Turing machine [Galil 81] language { T # P : P occurs in T } cannot be recognized in real time by (multi-tape) TM [Freidzon 68]

  5. Context and history Suffix Tree Supporting real-time Indexing under RAM model { T # P : P occurs in T } can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O ( | P | ) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet.

  6. Context and history Suffix Tree Supporting real-time Indexing under RAM model { T # P : P occurs in T } can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O ( | P | ) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet. Our result : an index that can be updated in real time and all occurrences of P in the current text are reported in time O ( | P | + nb occ ). The result assumes a constant-size alphabet.

  7. Context and history Suffix Tree Supporting real-time Updating a Suffix Tree

  8. Context and history Suffix Tree Supporting real-time Suffix Tree abbabac a b c a b b c a b c b a b a c a a c b c a c

  9. Context and history Suffix Tree Supporting real-time Suffix Tree Three classical linear-time algorithms for constructing a suffix tree [Weiner 73] : right-to-left construction [McCreight 76] : left-to-right [Ukkonen 95] : left-to-right online Weiner is more suitable for real-time as only a constant number of changes is made at each letter

  10. Context and history Suffix Tree Supporting real-time Towards real-time construction of suffix tree [Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O (log n ) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O (log log n ) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O (log log n + log log σ ) expected worst-case per symbol, unbounded alphabet log 2 log σ [Fischer, Gawrychowski arxiv 13] : O (log log n + log log log σ ) worst-case per symbol, unbounded alphabet

  11. Context and history Suffix Tree Supporting real-time Towards real-time construction of suffix tree [Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O (log n ) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O (log log n ) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O (log log n + log log σ ) expected worst-case per symbol, unbounded alphabet log 2 log σ [Fischer, Gawrychowski arxiv 13] : O (log log n + log log log σ ) worst-case per symbol, unbounded alphabet This work : O (log log n ) worst-case per symbol, log-size alphabet

  12. Context and history Suffix Tree Supporting real-time Weiner’s algoritm : W-links hm: W-links : for every node v , and for every letter a , P a ( v ) = av provided that node av exists The target of a W-link can be an explicit or an implicit node. The W-link is called respectively hard or soft Lemma : A soft W-link P a ( v ) is defined iff there is a unique closest descendant u such that P a ( u ) is hard, and P a ( v ) points to edge ( w , P a ( u )) a a b b c a b b c a c b b a b a c a a c b c a c

  13. Context and history Suffix Tree Supporting real-time Main idea of Weiner’s algorithm transforming suffix tree for t to suffix tree for at find the lowest ancestor u of t with a W-link P a ( u ) P a ( u ) is the branching point abbabac ⇒ babbabac a a b b c a b b c a b b c a b a c a a c b c a c

  14. Context and history Suffix Tree Supporting real-time Main idea of Weiner’s algorithm transforming suffix tree for t to suffix tree for at find the lowest ancestor u of t with a W-link P a ( u ) P a ( u ) is the branching point abbabac ⇒ babbabac a a b b c a b b c a b b c a b a c b a a c a b c b a a c c

  15. v 2 v 1 t Context and history Suffix Tree Supporting real-time Our implementation of Weiner Main ideas : we store only hard W-links, soft W-links are computed “on the fly” we maintain a list L W corresponding to the Euler tour of the tree each node with defined hard W-link W a ( u ) is “colored” by a in L W

  16. Context and history Suffix Tree Supporting real-time Our implementation of Weiner Main ideas : we store only hard W-links, soft W-links are computed “on the fly” we maintain a list L W corresponding to the Euler tour of the tree each node with defined hard W-link W a ( u ) is “colored” by a in L W Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link W a ( u ), let v 1 (resp. v 2 ) be the closest node colored with a preceding (resp. following) t in L W . Then u is the deepest node between lca ( t , v 1 ) and lca ( t , v 2 ). v 2 v 1 t

  17. Context and history Suffix Tree Supporting real-time Our implementation of Weiner Main ideas : we store only hard W-links, soft W-links are computed “on the fly” we maintain a list L W corresponding to the Euler tour of the tree each node with defined hard W-link W a ( u ) is “colored” by a in L W Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link W a ( u ), let v 1 (resp. v 2 ) be the closest node colored with a preceding (resp. following) t in L W . Then u is the deepest node between lca ( t , v 1 ) and lca ( t , v 2 ). v 2 u v 1 t

  18. Context and history Suffix Tree Supporting real-time Our implementation of Weiner Main ideas : we store only hard W-links, soft W-links are computed “on the fly” we maintain a list L W corresponding to the Euler tour of the tree each node with defined hard W-link W a ( u ) is “colored” by a in L W Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link W a ( u ), let v 1 (resp. v 2 ) be the closest node colored with a preceding (resp. following) t in L W . Then u is the deepest node between lca ( t , v 1 ) and lca ( t , v 2 ). v 2 u v 1 t

  19. Context and history Suffix Tree Supporting real-time Tools that we use Colored Predecessor in a List Problem : Maintain a dynamic list L (under insertions) whose elements are assigned natural numbers (“colors”). Colored predecessor queries : given an element e ∈ L and a color c , retrieve the closest element e ′ ∈ L preceding e with color c Theorem [Mortensen SODA 03 ; Giyora, Kaplan 09] : If the number of colors is smaller than log 1 / 4 n , then there exists a O ( |L| ) data structure that supports updates in O (log log |L| ) time and answers colored predecessor queries in O (log log |L| ) time.

  20. Context and history Suffix Tree Supporting real-time Tools that we use (cont.) Dynamic Lowest Common Ancestor (LCA) Problem : Maintain a dynamic tree (leave insertion/deletion, leaf edge split, edge merge) supporting lowest common ancestor of two nodes Theorem [Cole, Hariharan 05] : both updates and queries can be supported in worst-case O (1) time

  21. Context and history Suffix Tree Supporting real-time What we obtained so far Theorem We can maintain a suffix tree of right-to-left streaming text by spending O (log log n ) worst-case time on each symbol, assuming an alphabet size ≤ log 1 / 4 n . Simplifies and (slightly) generalizes [Breslauer, Italiano 11]

  22. Context and history Suffix Tree Supporting real-time Our solution to real-time text indexing

  23. Context and history Suffix Tree Supporting real-time Fully real-time text indexing on constant-size alphabet Main idea : Maintain three distinct data structures for patterns of length ≥ log 2 log n (long patterns), between log 2 log log n and log 2 log n (medium-size patterns), ≤ log 2 log log n (small patterns)

  24. Context and history Suffix Tree Supporting real-time Data structure for long patterns (sketch) Group text symbols into meta-symbols of size d = log log n / (4 log σ ). There are σ d = log 1 / 4 n meta-symbols.

  25. Context and history Suffix Tree Supporting real-time Data structure for long patterns (sketch) Group text symbols into meta-symbols of size d = log log n / (4 log σ ). There are σ d = log 1 / 4 n meta-symbols. Updates are done using the suffix tree construction, spending O (log log n ) time on each meta-symbol (i.e. amortized O (1) time on each symbol).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend