String indexing in the Word RAM model, part 4 Pawe Gawrychowski - - PowerPoint PPT Presentation

string indexing in the word ram model part 4
SMART_READER_LITE
LIVE PREVIEW

String indexing in the Word RAM model, part 4 Pawe Gawrychowski - - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model IV 1 / 32 We consider a fundamental data structure


slide-1
SLIDE 1

String indexing in the Word RAM model, part 4

Paweł Gawrychowski

University of Wrocław & Max-Planck-Institut für Informatik

Paweł Gawrychowski String indexing in the Word RAM model IV 1 / 32

slide-2
SLIDE 2

We consider a fundamental data structure question: how to represent a tree?

(Compacted) Trie

A trie is simply a tree with edges labeled by single characters. A compacted trie is created by replacing maximal chains of unary vertices with single edges labeled by (possibly long) words.

Navigation queries

Given a pattern p, we want to traverse the edges of a compacted trie to find the node corresponding to p. If there is no such node, we would like to compute its longest prefix for which the corresponding node does exist.

Paweł Gawrychowski String indexing in the Word RAM model IV 2 / 32

slide-3
SLIDE 3

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

slide-4
SLIDE 4

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

slide-5
SLIDE 5

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

slide-6
SLIDE 6

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hy

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d ugfecvbx

Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

slide-7
SLIDE 7

Splitting an edge

Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.

abrakadabra

Notice that this covers adding a new edge outgoing from an existing node.

Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32

slide-8
SLIDE 8

Splitting an edge

Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.

abrakadabra x y z

Notice that this covers adding a new edge outgoing from an existing node.

Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32

slide-9
SLIDE 9

Static case (yesterday)

Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?

Dynamic case

Can we maintain a compacted trie so that:

1

the resulting structure is small,

2

we can execute navigation queries efficiently,

3

we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.

Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32

slide-10
SLIDE 10

Static case (yesterday)

Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?

Dynamic case

Can we maintain a compacted trie so that:

1

the resulting structure is small,

2

we can execute navigation queries efficiently,

3

we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.

Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32

slide-11
SLIDE 11

Static case (yesterday)

Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?

Dynamic case

Can we maintain a compacted trie so that:

1

the resulting structure is small,

2

we can execute navigation queries efficiently,

3

we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.

Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32

slide-12
SLIDE 12

It seems reasonable to consider the scenario where σ is non-constant, yet (significantly) smaller than n. Hence we get the following question: what are the best possible time bounds in terms of σ?

Gawrychowski and Fischer

There exists a deterministic linear-size structure supporting navigation in O(m +

log2 log σ log log log σ) time and splitting edges in O( log2 log σ log log log σ).

To make the above result useful, we develop a suffix tree oracle which can be used to locate the edge which should be split after prepending a letter to the current text in O(log log n +

log2 log σ log log log σ) time.

Paweł Gawrychowski String indexing in the Word RAM model IV 6 / 32

slide-13
SLIDE 13

Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.

Levels of nodes

Let f(ℓ) = 2( 3

2 )ℓ. We say that a node v is of level ℓ when the number of

leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.

Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32

slide-14
SLIDE 14

Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.

Levels of nodes

Let f(ℓ) = 2( 3

2 )ℓ. We say that a node v is of level ℓ when the number of

leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.

Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32

slide-15
SLIDE 15

Let us consider the dynamic case, and assume that n = O(σ). Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups.

Levels of nodes

Let f(ℓ) = 2( 3

2 )ℓ. We say that a node v is of level ℓ when the number of

leaves in its subtree belongs to [f(ℓ), 2f(ℓ + 1)]. We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level.

Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32

slide-16
SLIDE 16

Now, we classify the edges into two types:

1

from a node to a node of the same level,

2

from a node to a node of a smaller level,

Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

slide-17
SLIDE 17

Now, we classify the edges into two types:

1

from a node to a node of the same level,

2

from a node to a node of a smaller level, Those edges are stored in a static dictionary with a constant access

  • time. We already know that such dictionary can be constructed in

close-to-linear time, and this turns out to be enough because of the way we defined the levels. More precisely, it cannot happen too often that a level of a node increases.

Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

slide-18
SLIDE 18

Now, we classify the edges into two types:

1

from a node to a node of the same level,

2

from a node to a node of a smaller level, Those edges are stored in a dynamic dictionary structure. For this we develop a weighted variant of the exponential search trees of Andersson and Thorup, which we call the wexponential search trees.

Andersson and Thorup 2002

An exponential search tree is a dynamic predecessor structure storing a subset of [1, U] with O( log2 log U

log log log U ) time for insertions and

predecessor queries.

Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

slide-19
SLIDE 19

Even without the modification, the query complexity is fairly decent, namely O(m +

log3 log σ log log log σ). This is because there are at most

t = Θ(log log σ) edges of type (2) on any path descending from the root.

wt wt−1 wt−2 wt−3 wi ∈ [f(i), 2f(i + 1)]

Paweł Gawrychowski String indexing in the Word RAM model IV 9 / 32

slide-20
SLIDE 20

We want to be faster though. The subsequent accesses to the dynamic dictionary structures are not completely independent, so there is hope!

Wexponential search trees

There exists a linear-size dynamic structure storing a collection of n weighted elements from [1, U] with the following bounds:

1

predecessor search takes O(log log W

log w log log U log log log U ), where W is the

current total weight, and w is the weight of the predecessor,

2

inserting a new element of weight 1 takes O(log log W),

3

increasing a weight of an element of weight w by 1 takes O(log log W

log w ).

Paweł Gawrychowski String indexing in the Word RAM model IV 10 / 32

slide-21
SLIDE 21

Now if we use this structure instead of the standard exponential search trees, the total complexity of all queries at nodes where we decrease the current level becomes:

  • i=t−1

log log wi+1 log wi log log U log log log U = log log U log log log U log log wt = log log U log log log U log log U = log2 log U log log log U (this clearly ignores all the details necessary to show that the structures can be efficiently updated, which is not obvious...)

Paweł Gawrychowski String indexing in the Word RAM model IV 11 / 32

slide-22
SLIDE 22

Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.

Beame and Fich STOC’90

A static predecessor search structure with O(

log log σ log log log σ) query time can

be constructed in O(k1+ǫ) time and space, where k is the number of elements.

Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32

slide-23
SLIDE 23

Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.

Beame and Fich STOC’90

A static predecessor search structure with O(

log log σ log log log σ) query time can

be constructed in O(k1+ǫ) time and space, where k is the number of elements.

Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32

slide-24
SLIDE 24

Wexponential search trees are based on a fairly simple idea (but the details, again, are many). Imagine that each element of weight w is a fragment of such length, and draw all of them on a [1, W] segment. Then choose a set of roughly √ W evenly spaced splitters. Store them in a static predecessor structure, and recursively build a smaller wexponential search tree for each of the resulting roughly √ W subsets.

Beame and Fich STOC’90

A static predecessor search structure with O(

log log σ log log log σ) query time can

be constructed in O(k1+ǫ) time and space, where k is the number of elements.

Paweł Gawrychowski String indexing in the Word RAM model IV 12 / 32

slide-25
SLIDE 25

What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its

  • subtree. The level of such node can be safely increase to ℓ + 1, too.

We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment

  • f level ℓ + 1.

Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32

slide-26
SLIDE 26

What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its

  • subtree. The level of such node can be safely increase to ℓ + 1, too.

We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment

  • f level ℓ + 1.

Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32

slide-27
SLIDE 27

What about the updates, i.e., splitting an edge? Then the numbers of leaves increase by one for many nodes, and we might need to increase the levels of some of them. More precisely, we look at the path from the new leaf to the root, and for every ℓ we might have the root r of a fragment of level ℓ such that its size was 2f(ℓ + 1), so after adding 1 its level must be ℓ + 1. We call this promoting at t. We start at the root and go down as long as there is a child of the current node with at least f(ℓ + 1) leaves in its

  • subtree. The level of such node can be safely increase to ℓ + 1, too.

We call the traversed path the tail. After increasing the levels of all nodes on the tail, the current fragment splits into multiple fragments, and the tail either creates a new fragment of level ℓ + 1, or gets attached to an already existing fragment

  • f level ℓ + 1.

Paweł Gawrychowski String indexing in the Word RAM model IV 13 / 32

slide-28
SLIDE 28

tail last node on the tail r of weight 2f(ℓ + 1) ≤ f(ℓ + 1)

There are O(f(ℓ + 1)) nodes in the subtree of r, so we can traverse it. Additionally, we might need to rebuild the static dictionary at the parent

  • f r and insert some element into the wexponential search tree at the

last node of the tail.

Paweł Gawrychowski String indexing in the Word RAM model IV 14 / 32

slide-29
SLIDE 29

1

The static dictionary contains at most f(ℓ+2)

f(ℓ+1) elements, so

rebuilding takes O(f(ℓ + 2) f(ℓ + 1) log2 log f(ℓ + 2 f(ℓ + 1) = O(f 2(ℓ + 2) f 2(ℓ + 1)).

2

We have at most 2f(ℓ+1)

f(ℓ)

elements to insert into the wexponential search tree. More precisely, we insert elements of weight 1, and then repeatedly increase their weights (there is one technical detail here: if the target weight is w, we actually increase the weight to √w). Then the total time is: O(f(ℓ + 1) f(ℓ)

  • f(ℓ) log log f(ℓ + 1)).

Paweł Gawrychowski String indexing in the Word RAM model IV 15 / 32

slide-30
SLIDE 30

1

The static dictionary contains at most f(ℓ+2)

f(ℓ+1) elements, so

rebuilding takes O(f(ℓ + 2) f(ℓ + 1) log2 log f(ℓ + 2 f(ℓ + 1) = O(f 2(ℓ + 2) f 2(ℓ + 1)).

2

We have at most 2f(ℓ+1)

f(ℓ)

elements to insert into the wexponential search tree. More precisely, we insert elements of weight 1, and then repeatedly increase their weights (there is one technical detail here: if the target weight is w, we actually increase the weight to √w). Then the total time is: O(f(ℓ + 1) f(ℓ)

  • f(ℓ) log log f(ℓ + 1)).

Paweł Gawrychowski String indexing in the Word RAM model IV 15 / 32

slide-31
SLIDE 31

In total, the update time is O( f(ℓ+2)

f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by

the choice of f.

Amortization

A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!

Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32

slide-32
SLIDE 32

In total, the update time is O( f(ℓ+2)

f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by

the choice of f.

Amortization

A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!

Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32

slide-33
SLIDE 33

In total, the update time is O( f(ℓ+2)

f(ℓ+1) + f(ℓ + 1)), which is O(f(ℓ + 1)) by

the choice of f.

Amortization

A fragment has max(0, w − f(ℓ + 1)) credits, where w is the number of leaves in the subtree of its root. So, we had f(ℓ + 1) credits before we started the whole process, and a closer inspection shows that the new fragments of level ℓ don’t need any credits, so we can spend all of them!

Paweł Gawrychowski String indexing in the Word RAM model IV 16 / 32

slide-34
SLIDE 34

Now we move to indexing a compressed text.

Paweł Gawrychowski String indexing in the Word RAM model IV 17 / 32

slide-35
SLIDE 35

Lempel-Ziv based compression methods

Text t[1..N] is partitioned into disjoint blocks b1b2 . . . bn. Each block is defined in terms of the blocks on its left. What we exactly mean by “defined” depends on the exact version. The most common are the following two: LZ77, LZ the next block bi is a subword of the already processed prefix concatenated with exactly one new character, zip,gzip,PNG LZ78, LZW the next block bi is a block on the left concatenated with exactly one new character. compress,GIF,TIFF,PDF

Paweł Gawrychowski String indexing in the Word RAM model IV 18 / 32

slide-36
SLIDE 36

An example of LZW compression:

ababbababababababababaabbbaa

Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.

Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32

slide-37
SLIDE 37

An example of LZW compression:

ababbababababababababaabbbaa

Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.

Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32

slide-38
SLIDE 38

An example of LZW compression:

ababbababababababababaabbbaa

Even though n ∈ Ω( √ N), the compression/decompression are fast and simple, so the method is useful.

Paweł Gawrychowski String indexing in the Word RAM model IV 19 / 32

slide-39
SLIDE 39

An example of LZ compression:

ababbababaaabbababaabaabbbaa

It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.

Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32

slide-40
SLIDE 40

An example of LZ compression:

ababbababaaabbababaabaabbbaa

It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.

Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32

slide-41
SLIDE 41

An example of LZ compression:

ababbababaaabbababaabaabbbaa

It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.

Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32

slide-42
SLIDE 42

An example of LZ compression:

ababbababaaabbababaabaabbbaa

It is easy to construct an example, where n = O(log N). Well, most probably such example will not occur in practice, but anyway such good compression method is achieved for the Fibonacci words, which are often used as a “benchmark” for text algorithms. There is also the self-referential variant, where the new block can refer to itself.

Paweł Gawrychowski String indexing in the Word RAM model IV 20 / 32

slide-43
SLIDE 43

The blocks are described by pairs (in LZW) or triples (in LZ):

...ababbababaaabbababaabaabbbaaa...

...,a,b,(1,2,b),(1,4,a),(1,1,a),(4,8,b),(11,4,b),(10,2,a),... p=aaab

Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32

slide-44
SLIDE 44

The blocks are described by pairs (in LZW) or triples (in LZ):

...ababbababaaabbababaabaabbbaaa...

...,a,b,(1,2,b),(1,4,a),(1,1,a),(4,8,b),(11,4,b),(10,2,a),... p=aaab

Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32

slide-45
SLIDE 45

The blocks are described by pairs (in LZW) or triples (in LZ):

...ababbababaaabbababaabaabbbaaa...

...,a,b,(1,2,b),(1,4,a),(1,1,a),(4,8,b),(11,4,b),(10,2,a),... p=aaab

Paweł Gawrychowski String indexing in the Word RAM model IV 21 / 32

slide-46
SLIDE 46

Motivation

We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching.

Pattern matching?

Given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index.

Paweł Gawrychowski String indexing in the Word RAM model IV 22 / 32

slide-47
SLIDE 47

Motivation

We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching.

Pattern matching?

Given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index.

Paweł Gawrychowski String indexing in the Word RAM model IV 22 / 32

slide-48
SLIDE 48

Problem, more precisely

We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.

Why LZ77?

The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.

Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32

slide-49
SLIDE 49

Problem, more precisely

We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.

Why LZ77?

The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.

Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32

slide-50
SLIDE 50

Problem, more precisely

We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases.

Why LZ77?

The number of those phrases is believed to be the right measure of how repetitive the text is. We want to use space proportional to z, not n.

Paweł Gawrychowski String indexing in the Word RAM model IV 23 / 32

slide-51
SLIDE 51

Solution?

Straight-line program, or grammar representation

Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n

z words. The

program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.

But how do we search?!

Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32

slide-52
SLIDE 52

Solution?

Straight-line program, or grammar representation

Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n

z words. The

program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.

But how do we search?!

Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32

slide-53
SLIDE 53

Solution?

Straight-line program, or grammar representation

Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just r = O(z log n

z words. The

program can be assumed to be balanced, meaning that for each production A → BC we have that |B| ≈ |C|. Extracting an arbitrary substring of length ℓ from a balanced SLP takes O(log n + ℓ) time.

But how do we search?!

Paweł Gawrychowski String indexing in the Word RAM model IV 24 / 32

slide-54
SLIDE 54

Framework (of Navarro)

Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions

A LZ77 Self-Index

7 7 6 l (_,1) (a,1) $ a 1 3 b d 8 l r 2 4 5 $

a b a r a _ _ l a l a _ a l a b a r d a $

1 3 4 5 6 7 8 9 0 1 2 3 4 7 8 9 0 1 5 6 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 3 4 5 6 7 8 9 _ (l,2) (a,1) _ b 2 $ _ 3 b r (l,5) _ 1 d 6 4 9 8 5

  • G. Navarro

Indexing LZ77

Paweł Gawrychowski String indexing in the Word RAM model IV 25 / 32

slide-55
SLIDE 55

Idea

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-56
SLIDE 56

Idea

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-57
SLIDE 57

Idea

Secondary occurrence

An occurrence is secondary iff it is completely contained in some phrase.

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-58
SLIDE 58

Idea

Primary occurrence

An occurrence is primary iff it crosses some boundary.

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-59
SLIDE 59

Idea

Primary occurrence

An occurrence is primary iff it crosses some boundary.

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-60
SLIDE 60

Idea

Primary occurrence

An occurrence is primary iff it crosses some boundary.

Observation (by Kärkkäinen and Ukkonen?)

If the pattern occurs in the text, there is at least one primary

  • ccurrence.

Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting.

Paweł Gawrychowski String indexing in the Word RAM model IV 26 / 32

slide-61
SLIDE 61

Idea, continued

To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we

1

search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,

2

search for (P[1..i])R in the Patricia tree of the reversed phrases,

3

check the results via random access,

4

use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].

Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32

slide-62
SLIDE 62

Idea, continued

P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we

1

search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,

2

search for (P[1..i])R in the Patricia tree of the reversed phrases,

3

check the results via random access,

4

use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].

Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32

slide-63
SLIDE 63

Idea, continued

P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 ≤ i ≤ m, we

1

search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries,

2

search for (P[1..i])R in the Patricia tree of the reversed phrases,

3

check the results via random access,

4

use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m].

Paweł Gawrychowski String indexing in the Word RAM model IV 27 / 32

slide-64
SLIDE 64

Bookmarking

Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):

Lemma

Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.

Corollary

Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.

Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32

slide-65
SLIDE 65

Bookmarking

Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):

Lemma

Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.

Corollary

Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.

Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32

slide-66
SLIDE 66

Bookmarking

Because we know that we will extract characters from the phrase boundaries, we can replace O(log n + ℓ) with the following bound (we skip the proof):

Lemma

Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given ℓ ≤ g, we can extract S[b − ℓ..b + ℓ] in O(ℓ + log g) time.

Corollary

Given b, we can store O(log∗ z) words such that, given any ℓ, we can extract S[b − ℓ..b + ℓ] in O(ℓ) time.

Paweł Gawrychowski String indexing in the Word RAM model IV 28 / 32

slide-67
SLIDE 67

Space bounds (in words)

Patricia trees O(z) bookmarks O(z log∗ z) 1D range reporting O(z log log z) 4-sided 2D range reporting O(z log log z) 2-sided 2D range reporting O(z) O(z log log z)

Paweł Gawrychowski String indexing in the Word RAM model IV 29 / 32

slide-68
SLIDE 68

Time bounds

searching in Patricia trees O(m2) (with perfect hashing if necessary) extracting from bookmarks O(m2) 1D or 4-sided 2D range reporting O(m2) 2-sided 2D range reporting O(occ log log n) O(m2 + occ log log n)

Paweł Gawrychowski String indexing in the Word RAM model IV 30 / 32

slide-69
SLIDE 69

Final result

Theorem

Given a balanced SLP for a string S[1..n] whose LZ77 parse consists

  • f z phrases, we can add O(z log log z) words such that, given a

pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?

Theorem

We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.

Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32

slide-70
SLIDE 70

Final result

Theorem

Given a balanced SLP for a string S[1..n] whose LZ77 parse consists

  • f z phrases, we can add O(z log log z) words such that, given a

pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?

Theorem

We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.

Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32

slide-71
SLIDE 71

Final result

Theorem

Given a balanced SLP for a string S[1..n] whose LZ77 parse consists

  • f z phrases, we can add O(z log log z) words such that, given a

pattern P[1..m], we can find all occ occurrences of P in O(m2 + occ log log n) time. Can we decrease m2?

Theorem

We can store a string S[1..n] whose LZ77 parse consists of z phrases in O(z log n) space, so that later, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log m + occ log log n) time. We may report false positives with low probability. Idea: Karp-Rabin hashing instead of extracting.

Paweł Gawrychowski String indexing in the Word RAM model IV 31 / 32

slide-72
SLIDE 72

Questions?

Paweł Gawrychowski String indexing in the Word RAM model IV 32 / 32