Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets - - PowerPoint PPT Presentation

full fl
SMART_READER_LITE
LIVE PREVIEW

Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets - - PowerPoint PPT Presentation

Full-fl Full-edged Real-Time Indexing for Constant Size Alphabets Gregory Kucherov CNRS/LIGM Marne-la-Vall ee, France Yakov Nekrich University of Kansas, USA ICALP13, July 11, 2013 Context and history Suffix Tree Supporting real-time


slide-1
SLIDE 1

Full-edged Real-Time Indexing for Constant Size Alphabets

Gregory Kucherov CNRS/LIGM Marne-la-Vall´ ee, France Yakov Nekrich University of Kansas, USA ICALP’13, July 11, 2013

Full-fl

slide-2
SLIDE 2

Context and history Suffix Tree Supporting real-time

String Matching and Indexing

string matching : find all occurrences of a pattern P in a text T string matching : P is fixed (or given first) indexing : T is fixed (or given first) real-time processing : reading the data online and spending O(1) time on each character

slide-3
SLIDE 3

Context and history Suffix Tree Supporting real-time

History of the Problem and Related Work

slide-4
SLIDE 4

Context and history Suffix Tree Supporting real-time

Real-time string matching vs. Real-time indexing

language {P#T : P occurs in T} can be recognized in real time by a Turing machine [Galil 81] language {T#P : P occurs in T} cannot be recognized in real time by (multi-tape) TM [Freidzon 68]

slide-5
SLIDE 5

Context and history Suffix Tree Supporting real-time

Indexing under RAM model

{T#P : P occurs in T} can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O(|P|) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet.

slide-6
SLIDE 6

Context and history Suffix Tree Supporting real-time

Indexing under RAM model

{T#P : P occurs in T} can be recognized in real time on RAM [Slisenko 76-78] same result in [Kosaraju STOC 94] there is an index of T that can be updated in real time such that for any pattern query P made at any moment, one can check of P occurs in current T in time O(|P|) [Amir,Nor SODA 08]. The result assumes a constant-size alphabet. Our result : an index that can be updated in real time and all

  • ccurrences of P in the current text are reported in time

O(|P| + nb occ). The result assumes a constant-size alphabet.

slide-7
SLIDE 7

Context and history Suffix Tree Supporting real-time

Updating a Suffix Tree

slide-8
SLIDE 8

Context and history Suffix Tree Supporting real-time

Suffix Tree

abbabac

a c b b c a c b a b a c a b a c c b a b a c

slide-9
SLIDE 9

Context and history Suffix Tree Supporting real-time

Suffix Tree

Three classical linear-time algorithms for constructing a suffix tree [Weiner 73] : right-to-left construction [McCreight 76] : left-to-right [Ukkonen 95] : left-to-right online Weiner is more suitable for real-time as only a constant number of changes is made at each letter

slide-10
SLIDE 10

Context and history Suffix Tree Supporting real-time

Towards real-time construction of suffix tree

[Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O(log n) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O(log log n) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O(log log n + log log σ) expected worst-case per symbol, unbounded alphabet [Fischer, Gawrychowski arxiv 13] : O(log log n +

log2 log σ log log log σ)

worst-case per symbol, unbounded alphabet

slide-11
SLIDE 11

Context and history Suffix Tree Supporting real-time

Towards real-time construction of suffix tree

[Amir, Kopelowitz, Lewenstein, Lewenstein SPIRE 05] : O(log n) worst-case per symbol, unbounded alphabet [Breslauer, Italiano SPIRE 11] : O(log log n) worst-case per symbol, constant alphabet [Kopelowitz FOCS 12] : O(log log n + log log σ) expected worst-case per symbol, unbounded alphabet [Fischer, Gawrychowski arxiv 13] : O(log log n +

log2 log σ log log log σ)

worst-case per symbol, unbounded alphabet This work : O(log log n) worst-case per symbol, log-size alphabet

slide-12
SLIDE 12

Context and history Suffix Tree Supporting real-time

Weiner’s algoritm : W-links

W-links : for every node v, and for every letter a, Pa(v) = av provided that node av exists The target of a W-link can be an explicit or an implicit node. The W-link is called respectively hard or soft Lemma : A soft W-link Pa(v) is defined iff there is a unique closest descendant u such that Pa(u) is hard, and Pa(v) points to edge (w, Pa(u))

a c b b c a c b a b a c a b a c c b a b a c a b

hm:

slide-13
SLIDE 13

Context and history Suffix Tree Supporting real-time

Main idea of Weiner’s algorithm

transforming suffix tree for t to suffix tree for at

find the lowest ancestor u of t with a W-link Pa(u) Pa(u) is the branching point

abbabac ⇒ babbabac

a c b b c a c b a b a c a b a c c b a b a c a b

slide-14
SLIDE 14

Context and history Suffix Tree Supporting real-time

Main idea of Weiner’s algorithm

transforming suffix tree for t to suffix tree for at

find the lowest ancestor u of t with a W-link Pa(u) Pa(u) is the branching point

abbabac ⇒ babbabac

a c b b c a c b a b a c a b a c c b a b a c a b b a b a c

slide-15
SLIDE 15

Context and history Suffix Tree Supporting real-time

Our implementation of Weiner

Main ideas :

we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW

t v1

2

v

slide-16
SLIDE 16

Context and history Suffix Tree Supporting real-time

Our implementation of Weiner

Main ideas :

we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).

t v1

2

v

slide-17
SLIDE 17

Context and history Suffix Tree Supporting real-time

Our implementation of Weiner

Main ideas :

we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).

t v1

2

v u

slide-18
SLIDE 18

Context and history Suffix Tree Supporting real-time

Our implementation of Weiner

Main ideas :

we store only hard W-links, soft W-links are computed “on the fly” we maintain a list LW corresponding to the Euler tour of the tree each node with defined hard W-link Wa(u) is “colored” by a in LW Lemma : To find the deepest ancestor u of t with defined (possibly soft) W-link Wa(u), let v1 (resp. v2) be the closest node colored with a preceding (resp. following) t in LW . Then u is the deepest node between lca(t, v1) and lca(t, v2).

t v1

2

v u

slide-19
SLIDE 19

Context and history Suffix Tree Supporting real-time

Tools that we use

Colored Predecessor in a List Problem : Maintain a dynamic list L (under insertions) whose elements are assigned natural numbers (“colors”). Colored predecessor queries : given an element e ∈ L and a color c, retrieve the closest element e′ ∈ L preceding e with color c Theorem [Mortensen SODA 03 ; Giyora, Kaplan 09] : If the number of colors is smaller than log1/4 n, then there exists a O(|L|) data structure that supports updates in O(log log |L|) time and answers colored predecessor queries in O(log log |L|) time.

slide-20
SLIDE 20

Context and history Suffix Tree Supporting real-time

Tools that we use (cont.)

Dynamic Lowest Common Ancestor (LCA) Problem : Maintain a dynamic tree (leave insertion/deletion, edge split, edge merge) supporting lowest common ancestor of two nodes Theorem [Cole, Hariharan 05] : both updates and queries can be supported in worst-case O(1) time leaf

slide-21
SLIDE 21

Context and history Suffix Tree Supporting real-time

What we obtained so far

Theorem We can maintain a suffix tree of right-to-left streaming text by spending O(log log n) worst-case time on each symbol, assuming an alphabet size ≤ log1/4 n. Simplifies and (slightly) generalizes [Breslauer, Italiano 11]

slide-22
SLIDE 22

Context and history Suffix Tree Supporting real-time

Our solution to real-time text indexing

slide-23
SLIDE 23

Context and history Suffix Tree Supporting real-time

Fully real-time text indexing on constant-size alphabet

Main idea : Maintain three distinct data structures for patterns of length ≥ log2 log n (long patterns), between log2 log log n and log2 log n (medium-size patterns), ≤ log2 log log n (small patterns)

slide-24
SLIDE 24

Context and history Suffix Tree Supporting real-time

Data structure for long patterns (sketch)

Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols.

slide-25
SLIDE 25

Context and history Suffix Tree Supporting real-time

Data structure for long patterns (sketch)

Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol).

slide-26
SLIDE 26

Context and history Suffix Tree Supporting real-time

Data structure for long patterns (sketch)

Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol). To match a long pattern P, consider all offsets δ, 0 ≤ δ ≤ d − 1. For each δ, P can be matched in time O(|P|/d + log log n + nb occδ) using colored range reporting (details left out).

slide-27
SLIDE 27

Context and history Suffix Tree Supporting real-time

Data structure for long patterns (sketch)

Group text symbols into meta-symbols of size d = log log n/(4 log σ). There are σd = log1/4 n meta-symbols. Updates are done using the suffix tree construction, spending O(log log n) time on each meta-symbol (i.e. amortized O(1) time on each symbol). To match a long pattern P, consider all offsets δ, 0 ≤ δ ≤ d − 1. For each δ, P can be matched in time O(|P|/d + log log n + nb occδ) using colored range reporting (details left out). Overall we obtain time O(d(|P|/d + log log n) + nb occ) = O(|P| + nb occ) as |P| ≥ log2 log n

slide-28
SLIDE 28

Context and history Suffix Tree Supporting real-time

Medium-size and small patterns

Medium-size patterns : Similar to long patterns : the suffix tree stores truncated suffixes of length log2 log n over meta-symbols of log log log n text symbols update takes O(log log log n) time per meta-symbol when matching a pattern, the overhead is O(log2 log log n) which is subsumed by minimal pattern length

slide-29
SLIDE 29

Context and history Suffix Tree Supporting real-time

Medium-size and small patterns

Medium-size patterns : Similar to long patterns : the suffix tree stores truncated suffixes of length log2 log n over meta-symbols of log log log n text symbols update takes O(log log log n) time per meta-symbol when matching a pattern, the overhead is O(log2 log log n) which is subsumed by minimal pattern length Small patterns : Tabulate all possible trees and all possible updates

slide-30
SLIDE 30

Context and history Suffix Tree Supporting real-time

Turning it fully real-time

Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized.

i+d i i-d P

slide-31
SLIDE 31

Context and history Suffix Tree Supporting real-time

Turning it fully real-time

Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized. Problem 2 : Most recent blocks should have a special treatment.

i+d i i-d P

slide-32
SLIDE 32

Context and history Suffix Tree Supporting real-time

Turning it fully real-time

Three more problems should be overcome to make this solution real-time Problem 1 : Block processing should be deamortized. Problem 2 : Most recent blocks should have a special treatment. Problem 3 : Text length n is unknown.

2n n n/2

slide-33
SLIDE 33

Context and history Suffix Tree Supporting real-time

Summary : Main result For a streaming text over a constant-size alphabet, there exists a data structure that can be updated in real time such that at any moment, all positions of any pattern P in the current text can be reported in time O(|P| + nb occ). for full details see arXiv :1302.4016