String indexing in the Word RAM model, part 4
Paweł Gawrychowski
University of Wrocław & Max-Planck-Institut für Informatik
Paweł Gawrychowski String indexing in the Word RAM model IV 1 / 32
String indexing in the Word RAM model, part 4 Pawe Gawrychowski - - PowerPoint PPT Presentation
String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model IV 1 / 32 We consider a fundamental data structure
Paweł Gawrychowski
University of Wrocław & Max-Planck-Institut für Informatik
Paweł Gawrychowski String indexing in the Word RAM model IV 1 / 32
We consider a fundamental data structure question: how to represent a tree?
(Compacted) Trie
A trie is simply a tree with edges labeled by single characters. A compacted trie is created by replacing maximal chains of unary vertices with single edges labeled by (possibly long) words.
Navigation queries
Given a pattern p, we want to traverse the edges of a compacted trie to find the node corresponding to p. If there is no such node, we would like to compute its longest prefix for which the corresponding node does exist.
Paweł Gawrychowski String indexing in the Word RAM model IV 2 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie.
Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie.
Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie.
Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie.
Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Splitting an edge
Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.
Notice that this covers adding a new edge outgoing from an existing node.
Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32
Splitting an edge
Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.
Notice that this covers adding a new edge outgoing from an existing node.
Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32
Static case (yesterday)
Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?
Dynamic case
Can we maintain a compacted trie so that:
1
the resulting structure is small,
2
we can execute navigation queries efficiently,
3
we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.
Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
Static case (yesterday)
Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?
Dynamic case
Can we maintain a compacted trie so that:
1
the resulting structure is small,
2
we can execute navigation queries efficiently,
3
we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.
Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
Static case (yesterday)
Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?
Dynamic case
Can we maintain a compacted trie so that:
1
the resulting structure is small,
2
we can execute navigation queries efficiently,
3
we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.
Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
It seems reasonable to consider the scenario where σ is non-constant, yet (significantly) smaller than n. Hence we get the following question: what are the best possible time bounds in terms of σ?
Gawrychowski and Fischer
There exists a deterministic linear-size structure supporting navigation in O(m +
log2 log σ log log log σ) time and splitting edges in O( log2 log σ log log log σ).
To make the above result useful, we develop a suffix tree oracle which can be used to locate the edge which should be split after prepending a letter to the current text in O(log log n +
log2 log σ log log log σ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 6 / 32
Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.
Levels of nodes
Let f(ℓ) = 2( 3
2 )ℓ. We say that a node v is of level ℓ when the number of
leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.
Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.
Levels of nodes
Let f(ℓ) = 2( 3
2 )ℓ. We say that a node v is of level ℓ when the number of
leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.
Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.
Levels of nodes
Let f(ℓ) = 2( 3
2 )ℓ. We say that a node v is of level ℓ when the number of
leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.
Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Now, we classify the edges into two types:
1
from a node to a node of the same level,
2
from a node to a node of a smaller level,
Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Now, we classify the edges into two types:
1
from a node to a node of the same level,
2
from a node to a node of a smaller level, Those edges are stored in a static dictionary with a constant access
close-to-linear time, and this turns out to be enough because of the way we defined the levels. More precisely, it cannot happen too often that a level of a node increases.
Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Now, we classify the edges into two types:
1
from a node to a node of the same level,
2
from a node to a node of a smaller level, Those edges are stored in a dynamic dictionary structure. For this we develop a weighted variant of the exponential search trees of Andersson and Thorup, which we call the wexponential search trees.
Andersson and Thorup 2002
An exponential search tree is a dynamic predecessor structure storing a subset of [1, U] with O( log2 log U
log log log U ) time for insertions and
predecessor queries.
Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Even without the modification, the query complexity is fairly decent, namely O(m +
log3 log σ log log log σ). This is because there are at most
t = Θ(log log σ) edges of type (2) on any path descending from the root.
wt wt−1 wt−2 wt−3 wi ∈ [f(i), 2f(i + 1)]
Paweł Gawrychowski String indexing in the Word RAM model IV 9 / 32
We want to be faster though. The subsequent accesses to the dynamic dictionary structures are not completely independent, so there is hope!
Wexponential search trees
There exists a linear-size dynamic structure storing a collection of n weighted elements from [1, U] with the following bounds:
1
predecessor search takes O(log log W
log w log log U log log log U ), where W is the
current total weight, and w is the weight of the predecessor,
2
inserting a new element of weight 1 takes O(log log W),
3
increasing a weight of an element of weight w by 1 takes O(log log W
log w ).
Paweł Gawrychowski String indexing in the Word RAM model IV 10 / 32
Now if we use this structure instead of the standard exponential search trees, the total complexity of all queries at nodes where we decrease the current level becomes:
log log wi+1 log wi log log U log log log U = log log U log log log U log log wt = log log U log log log U log log U = log2 log U log log log U (this clearly ignores all the details necessary to show that the structures can be efficiently updated, which is not obvious...)
Paweł Gawrychowski String indexing in the Word RAM model IV 11 / 32
Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.
Beame and Fich STOC’90
A static predecessor search structure with O(
log log σ log log log σ) query time can
be constructed in O(k1+ǫ) time and space, where k is the number of elements.
Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32
Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.
Beame and Fich STOC’90
A static predecessor search structure with O(
log log σ log log log σ) query time can
be constructed in O(k1+ǫ) time and space, where k is the number of elements.
Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32
Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.
Beame and Fich STOC’90
A static predecessor search structure with O(
log log σ log log log σ) query time can
be constructed in O(k1+ǫ) time and space, where k is the number of elements.
Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32
What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its
We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment
Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32
What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its
We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment
Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32
What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its
We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment
Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32
tail last node on the tail r of weight 2f(ℓ + 1) ≤ f(ℓ + 1)
There are O(f(ℓ + 1)) nodes in the subtree of r, so we can traverse it. Additionally, we might need to rebuild the static dictionary at the parent
last node of the tail.
Paweł Gawrychowski String indexing in the Word RAM model IV 14 / 32
1
The static dictionary contains at most f(ℓ+2)
f(ℓ+1) elements, so
rebuilding takes O(f(ℓ + 2) f(ℓ + 1) log2 log f(ℓ + 2 f(ℓ + 1) = O(f 2(ℓ + 2) f 2(ℓ + 1)).
2
We have at most 2f(ℓ+1)
f(ℓ)
elements to insert into the wexponential search tree. More precisely, we insert elements of weight 1, and then repeatedly increase their weights (there is one technical detail here: if the target weight is w, we actually increase the weight to √w). Then the total time is: O(f(ℓ + 1) f(ℓ)
Paweł Gawrychowski String indexing in the Word RAM model IV 15 / 32
1
The static dictionary contains at most f(ℓ+2)
f(ℓ+1) elements, so
rebuilding takes O(f(ℓ + 2) f(ℓ + 1) log2 log f(ℓ + 2 f(ℓ + 1) = O(f 2(ℓ + 2) f 2(ℓ + 1)).
2
We have at most 2f(ℓ+1)
f(ℓ)
elements to insert into the wexponential search tree. More precisely, we insert elements of weight 1, and then repeatedly increase their weights (there is one technical detail here: if the target weight is w, we actually increase the weight to √w). Then the total time is: O(f(ℓ + 1) f(ℓ)
Paweł Gawrychowski String indexing in the Word RAM model IV 15 / 32
In total, the update time is O( f(ℓ+2)
f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by
the choice of f.
Amortization
A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!
Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32
In total, the update time is O( f(ℓ+2)
f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by
the choice of f.
Amortization
A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!
Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32
In total, the update time is O( f(ℓ+2)
f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by
the choice of f.
Amortization
A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!
Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32
Now we move to indexing a compressed text.
Paweł Gawrychowski String indexing in the Word RAM model IV 17 / 32
Lempel-Ziv based compression methods
Text t[1..N] is partitioned into disjoint blocks b1b2 . . . bn. Each block is defined in terms of the blocks on its left. What we exactly mean by “defined” depends on the exact version. The most common are the following two: LZ77, LZ the next block bi is a subword of the already processed prefix concatenated with exactly one new character, zip,gzip,PNG LZ78, LZW the next block bi is a block on the left concatenated with exactly one new character. compress,GIF,TIFF,PDF
Paweł Gawrychowski String indexing in the Word RAM model IV 18 / 32
An example of LZW compression:
Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.
Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32
An example of LZW compression:
Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.
Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32
An example of LZW compression:
Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.
Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32
An example of LZ compression:
It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.
Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32
An example of LZ compression:
It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.
Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32
An example of LZ compression:
It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.
Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32
An example of LZ compression:
It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.
Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32
The blocks are described by pairs (in LZW) or triples (in LZ):
Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32
The blocks are described by pairs (in LZW) or triples (in LZ):
Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32
The blocks are described by pairs (in LZW) or triples (in LZ):
Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32
We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching.
Pattern matching?
Given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index.
Paweł Gawrychowski String indexing in the Word RAM model IV 22 / 32
We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching.
Pattern matching?
Given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index.
Paweł Gawrychowski String indexing in the Word RAM model IV 22 / 32
We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.
Why LZ77?
The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.
Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32
We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.
Why LZ77?
The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.
Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32
We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.
Why LZ77?
The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.
Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32
Straight-line program, or grammar representation
Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n
z words. The
program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32
Straight-line program, or grammar representation
Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n
z words. The
program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32
Straight-line program, or grammar representation
Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n
z words. The
program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32
Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions
A LZ77 Self-Index
7 7 6 l (_,1) (a,1) $ a 1 3 b d 8 l r 2 4 5 $
a b a r a _ _ l a l a _ a l a b a r d a $
1 3 4 5 6 7 8 9 0 1 2 3 4 7 8 9 0 1 5 6 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 3 4 5 6 7 8 9 _ (l,2) (a,1) _ b 2 $ _ 3 b r (l,5) _ 1 d 6 4 9 8 5
Indexing LZ77
Paweł Gawrychowski String indexing in the Word RAM model IV 25 / 32
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
Secondary occurrence
An occurrence is secondary iff it is completely contained in some phrase.
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
Primary occurrence
An occurrence is primary iff it crosses some boundary.
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
Primary occurrence
An occurrence is primary iff it crosses some boundary.
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
Primary occurrence
An occurrence is primary iff it crosses some boundary.
Observation (by Kärkkäinen and Ukkonen?)
If the pattern occurs in the text, there is at least one primary
Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.
Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32
To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we
1
search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,
2
search for (P[1..i])R in the Patricia tree of the reversed phrases,
3
check the results via random access,
4
use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].
Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32
P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we
1
search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,
2
search for (P[1..i])R in the Patricia tree of the reversed phrases,
3
check the results via random access,
4
use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].
Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32
P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we
1
search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,
2
search for (P[1..i])R in the Patricia tree of the reversed phrases,
3
check the results via random access,
4
use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].
Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32
Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):
Lemma
Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.
Corollary
Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32
Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):
Lemma
Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.
Corollary
Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32
Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):
Lemma
Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.
Corollary
Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.
Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32
Patricia trees O(z) bookmarks O(z log∗ z) 1D range reporting O(z log log z) 4-sided 2D range reporting O(z log log z) 2-sided 2D range reporting O(z) O(z log log z)
Paweł Gawrychowski String indexing in the Word RAM model IV 29 / 32
searching in Patricia trees O(m2) (with perfect hashing if necessary) extracting from bookmarks O(m2) 1D or 4-sided 2D range reporting O(m2) 2-sided 2D range reporting O(occ log log n) O(m2 + occ log log n)
Paweł Gawrychowski String indexing in the Word RAM model IV 30 / 32
Theorem
Given a balanced SLP for a string S[1..n] whose LZ77 parse consists
pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?
Theorem
We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.
Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32
Theorem
Given a balanced SLP for a string S[1..n] whose LZ77 parse consists
pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?
Theorem
We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.
Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32
Theorem
Given a balanced SLP for a string S[1..n] whose LZ77 parse consists
pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?
Theorem
We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.
Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32
Paweł Gawrychowski String indexing in the Word RAM model IV 32 / 32