Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets - - PowerPoint PPT Presentation
Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets - - PowerPoint PPT Presentation
Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets Gregory Kucherov CNRS/LIGM Marne-la-Vall ee, France Yakov Nekrich University of Kansas, USA ICALP13, July 11, 2013 Context and history Suffix Tree Supporting real-time
Context and history Suffix Tree Supporting real-time
String Matching and Indexing
string matching : find all occurrences of a pattern P in a text T string matching : P is fixed (or given first) indexing : T is fixed (or given first) real-time processing : reading the data online and spending O(1) time on each character
Context and history Suffix Tree Supporting real-time
History of the Problem and Related Work
Context and history Suffix Tree Supporting real-time
Real-time string matching vs. Real-time indexing
language {P#T : P occurs in T} can be recognized in real time by a Turing machine [Galil 81] language {T#P : P occurs in T} cannot be recognized in real time by (multi-tape) TM [Freidzon 68]
Context and history Suffix Tree Supporting real-time
Indexing under RAM model
{T#P : P occurs in T} can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O(|P|) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet.
Context and history Suffix Tree Supporting real-time
Indexing under RAM model
{T#P : P occurs in T} can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O(|P|) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet. Our result : an index that can be updated in real time and all
- ccurrences of P in the current text are reported in time
O(|P| + nb occ). The result assumes a constant-size alphabet.
Context and history Suffix Tree Supporting real-time
Updating a Suffix Tree
Context and history Suffix Tree Supporting real-time
Suffix Tree
abbabac
a c b b c a c b a b a c a b a c c b a b a c
Context and history Suffix Tree Supporting real-time
Suffix Tree
Three classical linear-time algorithms for constructing a suffix tree [Weiner 73] : right-to-left construction [McCreight 76] : left-to-right [Ukkonen 95] : left-to-right online Weiner is more suitable for real-time as only a constant number of changes is made at each letter
Context and history Suffix Tree Supporting real-time
Towards real-time construction of suffix tree
[Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O(log n) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O(log log n) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O(log log n + log log σ) expected worst-case per symbol, unbounded alphabet [Fischer, Gawrychowski arxiv 13] : O(log log n +
log2 log σ log log log σ)
worst-case per symbol, unbounded alphabet
Context and history Suffix Tree Supporting real-time
Towards real-time construction of suffix tree
[Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O(log n) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O(log log n) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O(log log n + log log σ) expected worst-case per symbol, unbounded alphabet [Fischer, Gawrychowski arxiv 13] : O(log log n +
log2 log σ log log log σ)
worst-case per symbol, unbounded alphabet This work : O(log log n) worst-case per symbol, log-size alphabet
Context and history Suffix Tree Supporting real-time
Weiner’s algoritm : W-links
W-links : for every node v, and for every letter a, Pa(v) = av provided that node av exists The target of a W-link can be an explicit or an implicit node. The W-link is called respectively hard or soft Lemma : A soft W-link Pa(v) is defined iff there is a unique closest descendant u such that Pa(u) is hard, and Pa(v) points to edge (w, Pa(u))
a c b b c a c b a b a c a b a c c b a b a c a b
hm:
Context and history Suffix Tree Supporting real-time
Main idea of Weiner’s algorithm
transforming suffix tree for t to suffix tree for at
find the lowest ancestor u of t with a W-link Pa(u) Pa(u) is the branching point
abbabac ⇒ babbabac
a c b b c a c b a b a c a b a c c b a b a c a b
Context and history Suffix Tree Supporting real-time
Main idea of Weiner’s algorithm
transforming suffix tree for t to suffix tree for at
find the lowest ancestor u of t with a W-link Pa(u) Pa(u) is the branching point
abbabac ⇒ babbabac
a c b b c a c b a b a c a b a c c b a b a c a b b a b a c
Context and history Suffix Tree Supporting real-time
Our implementation of Weiner
Main ideas :
we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW
t v1
2
v
Context and history Suffix Tree Supporting real-time
Our implementation of Weiner
Main ideas :
we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).
t v1
2
v
Context and history Suffix Tree Supporting real-time
Our implementation of Weiner
Main ideas :
we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).
t v1
2
v u
Context and history Suffix Tree Supporting real-time
Our implementation of Weiner
Main ideas :
we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).
t v1
2
v u
Context and history Suffix Tree Supporting real-time
Tools that we use
Colored Predecessor in a List Problem : Maintain a dynamic list L (under insertions) whose elements are assigned natural numbers (“colors”). Colored predecessor queries : given an element e ∈ L and a color c, retrieve the closest element e′ ∈ L preceding e with color c Theorem [Mortensen SODA 03 ; Giyora, Kaplan 09] : If the number of colors is smaller than log1/4 n, then there exists a O(|L|) data structure that supports updates in O(log log |L|) time and answers colored predecessor queries in O(log log |L|) time.
Context and history Suffix Tree Supporting real-time
Tools that we use (cont.)
Dynamic Lowest Common Ancestor (LCA) Problem : Maintain a dynamic tree (leave insertion/deletion, edge split, edge merge) supporting lowest common ancestor of two nodes Theorem [Cole, Hariharan 05] : both updates and queries can be supported in worst-case O(1) time leaf
Context and history Suffix Tree Supporting real-time
What we obtained so far
Theorem We can maintain a suffix tree of right-to-left streaming text by spending O(log log n) worst-case time on each symbol, assuming an alphabet size ≤ log1/4 n. Simplifies and (slightly) generalizes [Breslauer, Italiano 11]
Context and history Suffix Tree Supporting real-time
Our solution to real-time text indexing
Context and history Suffix Tree Supporting real-time
Fully real-time text indexing on constant-size alphabet
Main idea : Maintain three distinct data structures for patterns of length ≥ log2 log n (long patterns), between log2 log log n and log2 log n (medium-size patterns), ≤ log2 log log n (small patterns)
Context and history Suffix Tree Supporting real-time
Data structure for long patterns (sketch)
Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols.
Context and history Suffix Tree Supporting real-time
Data structure for long patterns (sketch)
Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol).
Context and history Suffix Tree Supporting real-time
Data structure for long patterns (sketch)
Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol). To match a long pattern P, consider all offsets δ, 0 ≤ δ ≤ d − 1. For each δ, P can be matched in time O(|P|/d + log log n + nb occδ) using colored range reporting (details left out).
Context and history Suffix Tree Supporting real-time
Data structure for long patterns (sketch)
Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol). To match a long pattern P, consider all offsets δ, 0 ≤ δ ≤ d − 1. For each δ, P can be matched in time O(|P|/d + log log n + nb occδ) using colored range reporting (details left out). Overall we obtain time O(d(|P|/d + log log n) + nb occ) = O(|P| + nb occ) as |P| ≥ log2 log n
Context and history Suffix Tree Supporting real-time
Medium-size and small patterns
Medium-size patterns : Similar to long patterns : the suffix tree stores truncated suffixes of length log2 log n over meta-symbols of log log log n text symbols update takes O(log log log n) time per meta-symbol when matching a pattern, the overhead is O(log2 log log n) which is subsumed by minimal pattern length
Context and history Suffix Tree Supporting real-time
Medium-size and small patterns
Medium-size patterns : Similar to long patterns : the suffix tree stores truncated suffixes of length log2 log n over meta-symbols of log log log n text symbols update takes O(log log log n) time per meta-symbol when matching a pattern, the overhead is O(log2 log log n) which is subsumed by minimal pattern length Small patterns : Tabulate all possible trees and all possible updates
Context and history Suffix Tree Supporting real-time
Turning it fully real-time
Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized.
i+d i i-d P
Context and history Suffix Tree Supporting real-time
Turning it fully real-time
Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized. Problem 2 : Most recent blocks should have a special treatment.
i+d i i-d P
Context and history Suffix Tree Supporting real-time
Turning it fully real-time
Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized. Problem 2 : Most recent blocks should have a special treatment. Problem 3 : Text length n is unknown.
2n n n/2
Context and history Suffix Tree Supporting real-time