String indexing in the Word RAM model, part 2 Pawe Gawrychowski - - PowerPoint PPT Presentation

string indexing in the word ram model part 2
SMART_READER_LITE
LIVE PREVIEW

String indexing in the Word RAM model, part 2 Pawe Gawrychowski - - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 2 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model II 1 / 29 Even though we showed yesterday that storing


slide-1
SLIDE 1

String indexing in the Word RAM model, part 2

Paweł Gawrychowski

University of Wrocław & Max-Planck-Institut für Informatik

Paweł Gawrychowski String indexing in the Word RAM model II 1 / 29

slide-2
SLIDE 2

Even though we showed yesterday that storing just 2n values of lcp(i, j) allows us to execute the binary search efficiently, being able to answer any lcp(i, j) would be great (we will see why during the exercises). Recall that we were able to reduce the question to the so-called RMQ problem.

RMQ

Given an array A[1..n], preprocess it so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed efficiently. First observe that answering any query in O(1) is trivial if we allow O(n2) time and space preprocessing.

Paweł Gawrychowski String indexing in the Word RAM model II 2 / 29

slide-3
SLIDE 3

Even though we showed yesterday that storing just 2n values of lcp(i, j) allows us to execute the binary search efficiently, being able to answer any lcp(i, j) would be great (we will see why during the exercises). Recall that we were able to reduce the question to the so-called RMQ problem.

RMQ

Given an array A[1..n], preprocess it so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed efficiently. First observe that answering any query in O(1) is trivial if we allow O(n2) time and space preprocessing.

Paweł Gawrychowski String indexing in the Word RAM model II 2 / 29

slide-4
SLIDE 4

Even though we showed yesterday that storing just 2n values of lcp(i, j) allows us to execute the binary search efficiently, being able to answer any lcp(i, j) would be great (we will see why during the exercises). Recall that we were able to reduce the question to the so-called RMQ problem.

RMQ

Given an array A[1..n], preprocess it so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed efficiently. First observe that answering any query in O(1) is trivial if we allow O(n2) time and space preprocessing.

Paweł Gawrychowski String indexing in the Word RAM model II 2 / 29

slide-5
SLIDE 5

Lemma

RMQ can be solved in O(1) time after O(n log n) time and space preprocessing. To prove the lemma, we will (again) apply the simple-yet-powerful doubling technique. For each k = 0, 1, . . . , log n construct a table Bk. Bk[i] = min{A[i], A[i + 1], A[i + 2], . . . , A[i + 2k − 1]} How? Well, B0[i] = A[i], and Bk+1[i] = min(Bk[i], Bk[i + 2k]). Hence we can easily answer a query concerning a fragment of length that is a power of 2. But, unfortunately, not all numbers are powers of 2...

Paweł Gawrychowski String indexing in the Word RAM model II 3 / 29

slide-6
SLIDE 6

Lemma

RMQ can be solved in O(1) time after O(n log n) time and space preprocessing. To prove the lemma, we will (again) apply the simple-yet-powerful doubling technique. For each k = 0, 1, . . . , log n construct a table Bk. Bk[i] = min{A[i], A[i + 1], A[i + 2], . . . , A[i + 2k − 1]} How? Well, B0[i] = A[i], and Bk+1[i] = min(Bk[i], Bk[i + 2k]). Hence we can easily answer a query concerning a fragment of length that is a power of 2. But, unfortunately, not all numbers are powers of 2...

Paweł Gawrychowski String indexing in the Word RAM model II 3 / 29

slide-7
SLIDE 7

Lemma

RMQ can be solved in O(1) time after O(n log n) time and space preprocessing. To prove the lemma, we will (again) apply the simple-yet-powerful doubling technique. For each k = 0, 1, . . . , log n construct a table Bk. Bk[i] = min{A[i], A[i + 1], A[i + 2], . . . , A[i + 2k − 1]} How? Well, B0[i] = A[i], and Bk+1[i] = min(Bk[i], Bk[i + 2k]). Hence we can easily answer a query concerning a fragment of length that is a power of 2. But, unfortunately, not all numbers are powers of 2...

Paweł Gawrychowski String indexing in the Word RAM model II 3 / 29

slide-8
SLIDE 8

...or are they?

Answering a query concerning a range [i,j]

To figure out which two power-of-two queries should be asked, compute k = ⌊log j − i + 1⌋. Then return min(Bk[i], Bk[j − 2k + 1]).

Paweł Gawrychowski String indexing in the Word RAM model II 4 / 29

slide-9
SLIDE 9

...or are they?

Answering a query concerning a range [i,j]

To figure out which two power-of-two queries should be asked, compute k = ⌊log j − i + 1⌋. Then return min(Bk[i], Bk[j − 2k + 1]).

Paweł Gawrychowski String indexing in the Word RAM model II 4 / 29

slide-10
SLIDE 10

...or are they? Any query can be split into at most log n power-of-two queries.

Answering a query concerning a range [i,j]

To figure out which two power-of-two queries should be asked, compute k = ⌊log j − i + 1⌋. Then return min(Bk[i], Bk[j − 2k + 1]).

Paweł Gawrychowski String indexing in the Word RAM model II 4 / 29

slide-11
SLIDE 11

...or are they? Any query can be covered with 2 power-of-two queries.

Answering a query concerning a range [i,j]

To figure out which two power-of-two queries should be asked, compute k = ⌊log j − i + 1⌋. Then return min(Bk[i], Bk[j − 2k + 1]).

Paweł Gawrychowski String indexing in the Word RAM model II 4 / 29

slide-12
SLIDE 12

...or are they? Any query can be covered with 2 power-of-two queries.

Answering a query concerning a range [i,j]

To figure out which two power-of-two queries should be asked, compute k = ⌊log j − i + 1⌋. Then return min(Bk[i], Bk[j − 2k + 1]).

Paweł Gawrychowski String indexing in the Word RAM model II 4 / 29

slide-13
SLIDE 13

Lemma

RMQ can be solved in O(log n) time after O(n) time and space preprocessing. We apply another simple-yet-powerful technique: micro-macro

  • decomposition. Chop the input array into blocks of length b = log n.

Construct a new array A′, where A′[i] = min{A[ib + 1], A[ib + 2], . . . , A[(i + 1)b]}. Build the previously described structure for A′.

Paweł Gawrychowski String indexing in the Word RAM model II 5 / 29

slide-14
SLIDE 14

Lemma

RMQ can be solved in O(log n) time after O(n) time and space preprocessing. We apply another simple-yet-powerful technique: micro-macro

  • decomposition. Chop the input array into blocks of length b = log n.

Construct a new array A′, where A′[i] = min{A[ib + 1], A[ib + 2], . . . , A[(i + 1)b]}. Build the previously described structure for A′.

Paweł Gawrychowski String indexing in the Word RAM model II 5 / 29

slide-15
SLIDE 15

Lemma

RMQ can be solved in O(log n) time after O(n) time and space preprocessing. We apply another simple-yet-powerful technique: micro-macro

  • decomposition. Chop the input array into blocks of length b = log n.

Construct a new array A′, where A′[i] = min{A[ib + 1], A[ib + 2], . . . , A[(i + 1)b]}. Build the previously described structure for A′.

Paweł Gawrychowski String indexing in the Word RAM model II 5 / 29

slide-16
SLIDE 16

Lemma

RMQ can be solved in O(log n) time after O(n) time and space preprocessing. We apply another simple-yet-powerful technique: micro-macro

  • decomposition. Chop the input array into blocks of length b = log n.

Construct a new array A′, where A′[i] = min{A[ib + 1], A[ib + 2], . . . , A[(i + 1)b]}. Build the previously described structure for A′.

Paweł Gawrychowski String indexing in the Word RAM model II 5 / 29

slide-17
SLIDE 17

For each block, precompute the maximum in each prefix and each suffix, which takes just O(n) time and space. Then, using the structure built for A′, we can answer any query in O(1) time. Unfortunately, life is not that simple. But the only case when we cannot answer a query in O(1) time is when the range is strictly inside a single block. Revert to the naive

  • ne-by-one computation!

Paweł Gawrychowski String indexing in the Word RAM model II 6 / 29

slide-18
SLIDE 18

For each block, precompute the maximum in each prefix and each suffix, which takes just O(n) time and space. Then, using the structure built for A′, we can answer any query in O(1) time. Unfortunately, life is not that simple. But the only case when we cannot answer a query in O(1) time is when the range is strictly inside a single block. Revert to the naive

  • ne-by-one computation!

Paweł Gawrychowski String indexing in the Word RAM model II 6 / 29

slide-19
SLIDE 19

For each block, precompute the maximum in each prefix and each suffix, which takes just O(n) time and space. Then, using the structure built for A′, we can answer any query in O(1) time. Unfortunately, life is not that simple. But the only case when we cannot answer a query in O(1) time is when the range is strictly inside a single block. Revert to the naive

  • ne-by-one computation!

Paweł Gawrychowski String indexing in the Word RAM model II 6 / 29

slide-20
SLIDE 20

For each block, precompute the maximum in each prefix and each suffix, which takes just O(n) time and space. Then, using the structure built for A′, we can answer any query in O(1) time. Unfortunately, life is not that simple. But the only case when we cannot answer a query in O(1) time is when the range is strictly inside a single block. Revert to the naive

  • ne-by-one computation!

Paweł Gawrychowski String indexing in the Word RAM model II 6 / 29

slide-21
SLIDE 21

OK, but we promised the best of both worlds: O(1) query and O(n) space.

Lemma

RMQ can be solved in O(1) time after O(n) time and space preprocessing. We “only” have to deal with the strictly-inside-a-block case. We will show how to do that for a very restricted case, when |A[i + 1] − A[i]| ≤ 1 (for the general case, wait for the exercises).

Paweł Gawrychowski String indexing in the Word RAM model II 7 / 29

slide-22
SLIDE 22

OK, but we promised the best of both worlds: O(1) query and O(n) space.

Lemma

RMQ can be solved in O(1) time after O(n) time and space preprocessing. We “only” have to deal with the strictly-inside-a-block case. We will show how to do that for a very restricted case, when |A[i + 1] − A[i]| ≤ 1 (for the general case, wait for the exercises).

Paweł Gawrychowski String indexing in the Word RAM model II 7 / 29

slide-23
SLIDE 23

The exact values of the elements don’t matter that much. So, for each block we compute its type, which is the sequence of differences A[i + 1] − A[i]. Additionally, for each such sequence we precompute the answers to all possible b

2

  • queries. The answer is the position of

the element with the smallest value.

How much space do we need for that?

3b−1b

2

  • = O(3bb2).

As long as b ≤ 0.001 log n, this is small, or o(n). Then to answer a query strictly inside a block, we look at its type, retrieve the precomputed answer, and then return the value at the corresponding position in A, all in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 8 / 29

slide-24
SLIDE 24

The exact values of the elements don’t matter that much. So, for each block we compute its type, which is the sequence of differences A[i + 1] − A[i]. Additionally, for each such sequence we precompute the answers to all possible b

2

  • queries. The answer is the position of

the element with the smallest value.

How much space do we need for that?

3b−1b

2

  • = O(3bb2).

As long as b ≤ 0.001 log n, this is small, or o(n). Then to answer a query strictly inside a block, we look at its type, retrieve the precomputed answer, and then return the value at the corresponding position in A, all in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 8 / 29

slide-25
SLIDE 25

The exact values of the elements don’t matter that much. So, for each block we compute its type, which is the sequence of differences A[i + 1] − A[i]. Additionally, for each such sequence we precompute the answers to all possible b

2

  • queries. The answer is the position of

the element with the smallest value.

How much space do we need for that?

3b−1b

2

  • = O(3bb2).

As long as b ≤ 0.001 log n, this is small, or o(n). Then to answer a query strictly inside a block, we look at its type, retrieve the precomputed answer, and then return the value at the corresponding position in A, all in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 8 / 29

slide-26
SLIDE 26

Compressing lcp array

Suffix array clearly takes linear space: we only need to store the arrays SA, SA−1, lcp, and the RMQ structure over lcp. Sounds great, but if we take a closer look, it might substantially exceed the size of the input. For example, if our string is binary, we need only n bits to represent it, and then the whole machinery adds O(n) words, which is O(n log n)

  • bits. Maybe we could do better?

Succinct RMQ

Given an array A[1..n], we can preprocess it using 2n + o(n) additional bits, so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed in O(1) time. (We will why during the exercises.) OK, but what about the lcp array?

Paweł Gawrychowski String indexing in the Word RAM model II 9 / 29

slide-27
SLIDE 27

Compressing lcp array

Suffix array clearly takes linear space: we only need to store the arrays SA, SA−1, lcp, and the RMQ structure over lcp. Sounds great, but if we take a closer look, it might substantially exceed the size of the input. For example, if our string is binary, we need only n bits to represent it, and then the whole machinery adds O(n) words, which is O(n log n)

  • bits. Maybe we could do better?

Succinct RMQ

Given an array A[1..n], we can preprocess it using 2n + o(n) additional bits, so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed in O(1) time. (We will why during the exercises.) OK, but what about the lcp array?

Paweł Gawrychowski String indexing in the Word RAM model II 9 / 29

slide-28
SLIDE 28

Compressing lcp array

Suffix array clearly takes linear space: we only need to store the arrays SA, SA−1, lcp, and the RMQ structure over lcp. Sounds great, but if we take a closer look, it might substantially exceed the size of the input. For example, if our string is binary, we need only n bits to represent it, and then the whole machinery adds O(n) words, which is O(n log n)

  • bits. Maybe we could do better?

Succinct RMQ

Given an array A[1..n], we can preprocess it using 2n + o(n) additional bits, so that the minimum of any fragment A[i], A[i + 1], . . . , A[j] can be computed in O(1) time. (We will why during the exercises.) OK, but what about the lcp array?

Paweł Gawrychowski String indexing in the Word RAM model II 9 / 29

slide-29
SLIDE 29

Recall that we have a nice observation about lcp array: lcp[SA−1[i]] − 1 ≤ lcp[SA−1[i + 1]] Define a(i) = lcp[SA−1[i]] + i − 1. Then: a(1) ≤ a(2) ≤ a(2) ≤ . . . ≤ a(n − 1) ≤ a(n) Furthermore, a(i) ∈ [0, n], because the length of w[i..n] is n − i + 1.

Paweł Gawrychowski String indexing in the Word RAM model II 10 / 29

slide-30
SLIDE 30

Recall that we have a nice observation about lcp array: lcp[SA−1[i]] − 1 ≤ lcp[SA−1[i + 1]] Define a(i) = lcp[SA−1[i]] + i − 1. Then: a(1) ≤ a(2) ≤ a(2) ≤ . . . ≤ a(n − 1) ≤ a(n) Furthermore, a(i) ∈ [0, n], because the length of w[i..n] is n − i + 1.

Paweł Gawrychowski String indexing in the Word RAM model II 10 / 29

slide-31
SLIDE 31

Recall that we have a nice observation about lcp array: lcp[SA−1[i]] − 1 ≤ lcp[SA−1[i + 1]] Define a(i) = lcp[SA−1[i]] + i − 1. Then: a(1) ≤ a(2) ≤ a(2) ≤ . . . ≤ a(n − 1) ≤ a(n) Furthermore, a(i) ∈ [0, n], because the length of w[i..n] is n − i + 1.

Paweł Gawrychowski String indexing in the Word RAM model II 10 / 29

slide-32
SLIDE 32

New (simpler) problem

How many bits of space do we need to store a nondecreasing sequence of numbers from [0, n]? We store the differences between every two consecutive a(i). The differences a′(i) = a(i) − a(i − 1) (where a(0) = 0) have the property that a′(i) ≥ 0 and

i a′(i) = n. So, it makes sense to store them as:

0a′(1)10a′(2)10a′(3) . . . 0a′(n−1)10a′(n)1 Extracting a(i) reduces to counting zeroes before the i-th one. We will show that a sequence of 2n bits can be stored using 2n + o(n) bits so that such operation can be performed in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 11 / 29

slide-33
SLIDE 33

New (simpler) problem

How many bits of space do we need to store a nondecreasing sequence of numbers from [0, n]? We store the differences between every two consecutive a(i). The differences a′(i) = a(i) − a(i − 1) (where a(0) = 0) have the property that a′(i) ≥ 0 and

i a′(i) = n. So, it makes sense to store them as:

0a′(1)10a′(2)10a′(3) . . . 0a′(n−1)10a′(n)1 Extracting a(i) reduces to counting zeroes before the i-th one. We will show that a sequence of 2n bits can be stored using 2n + o(n) bits so that such operation can be performed in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 11 / 29

slide-34
SLIDE 34

New (simpler) problem

How many bits of space do we need to store a nondecreasing sequence of numbers from [0, n]? We store the differences between every two consecutive a(i). The differences a′(i) = a(i) − a(i − 1) (where a(0) = 0) have the property that a′(i) ≥ 0 and

i a′(i) = n. So, it makes sense to store them as:

0a′(1)10a′(2)10a′(3) . . . 0a′(n−1)10a′(n)1 Extracting a(i) reduces to counting zeroes before the i-th one. We will show that a sequence of 2n bits can be stored using 2n + o(n) bits so that such operation can be performed in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 11 / 29

slide-35
SLIDE 35

New (simpler) problem

How many bits of space do we need to store a nondecreasing sequence of numbers from [0, n]? We store the differences between every two consecutive a(i). The differences a′(i) = a(i) − a(i − 1) (where a(0) = 0) have the property that a′(i) ≥ 0 and

i a′(i) = n. So, it makes sense to store them as:

0a′(1)10a′(2)10a′(3) . . . 0a′(n−1)10a′(n)1 Extracting a(i) reduces to counting zeroes before the i-th one. We will show that a sequence of 2n bits can be stored using 2n + o(n) bits so that such operation can be performed in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 11 / 29

slide-36
SLIDE 36

New (simpler) problem

How many bits of space do we need to store a nondecreasing sequence of numbers from [0, n]? We store the differences between every two consecutive a(i). The differences a′(i) = a(i) − a(i − 1) (where a(0) = 0) have the property that a′(i) ≥ 0 and

i a′(i) = n. So, it makes sense to store them as:

0a′(1)10a′(2)10a′(3) . . . 0a′(n−1)10a′(n)1 Extracting a(i) reduces to counting zeroes before the i-th one. We will show that a sequence of 2n bits can be stored using 2n + o(n) bits so that such operation can be performed in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model II 11 / 29

slide-37
SLIDE 37

Rank/select structure

Given a n-bit string, we want to add just o(n) bits of additional information, which allow us to find in O(1) time: rank(i) = the number of ones at or before position i, select(i) = position of the i-th one.

Paweł Gawrychowski String indexing in the Word RAM model II 12 / 29

slide-38
SLIDE 38

Rank

Tabulation

Let k = 1

2log n. There are just √n different binary strings of such size,

so we can afford to precompute, for each such string, the answer for each possible rank query. The space required is just O(2kk log k) = o(n). Now split the long string into fragments of length k. Store each such fragment in a single word, so that we can look-up the precomputed information quickly. Then, for each boundary between two fragments, store the cumulative rank. Total space is n

k log n = Θ(n) bits, too much.

Paweł Gawrychowski String indexing in the Word RAM model II 13 / 29

slide-39
SLIDE 39

Rank

Tabulation

Let k = 1

2log n. There are just √n different binary strings of such size,

so we can afford to precompute, for each such string, the answer for each possible rank query. The space required is just O(2kk log k) = o(n). Now split the long string into fragments of length k. Store each such fragment in a single word, so that we can look-up the precomputed information quickly. Then, for each boundary between two fragments, store the cumulative rank. Total space is n

k log n = Θ(n) bits, too much.

Paweł Gawrychowski String indexing in the Word RAM model II 13 / 29

slide-40
SLIDE 40

Rank

Tabulation

Let k = 1

2log n. There are just √n different binary strings of such size,

so we can afford to precompute, for each such string, the answer for each possible rank query. The space required is just O(2kk log k) = o(n). Now split the long string into fragments of length k. Store each such fragment in a single word, so that we can look-up the precomputed information quickly. Then, for each boundary between two fragments, store the cumulative rank. Total space is n

k log n = Θ(n) bits, too much.

Paweł Gawrychowski String indexing in the Word RAM model II 13 / 29

slide-41
SLIDE 41

But we can do better. Split the long string into fragments of length log2 n. For each boundary between two fragments, store the cumulative rank. This takes just O(

n log n) bits.

Then split each fragment into sub-fragments of size k. For each sub-fragment, store the cumulative rank within the fragment. This takes just O(

n log n log log n) bits.

And we are done.

Paweł Gawrychowski String indexing in the Word RAM model II 14 / 29

slide-42
SLIDE 42

But we can do better. Split the long string into fragments of length log2 n. For each boundary between two fragments, store the cumulative rank. This takes just O(

n log n) bits.

Then split each fragment into sub-fragments of size k. For each sub-fragment, store the cumulative rank within the fragment. This takes just O(

n log n log log n) bits.

And we are done.

Paweł Gawrychowski String indexing in the Word RAM model II 14 / 29

slide-43
SLIDE 43

But we can do better. Split the long string into fragments of length log2 n. For each boundary between two fragments, store the cumulative rank. This takes just O(

n log n) bits.

Then split each fragment into sub-fragments of size k. For each sub-fragment, store the cumulative rank within the fragment. This takes just O(

n log n log log n) bits.

And we are done.

Paweł Gawrychowski String indexing in the Word RAM model II 14 / 29

slide-44
SLIDE 44

Select

Similar, but more complicated. Because we are looking for the i-th one, we split into fragments with the same number of ones, instead of equal

  • size. Let t1 = log n log log n. We pick every t1-th one and store its index

in the whole string. This takes O( n

t1 log n) = o(n) bits. Then, given a

query, we divide it by t1 to locate the desired fragment. Hence from now on we can focus on single fragments.

Paweł Gawrychowski String indexing in the Word RAM model II 15 / 29

slide-45
SLIDE 45

Select

Similar, but more complicated. Because we are looking for the i-th one, we split into fragments with the same number of ones, instead of equal

  • size. Let t1 = log n log log n. We pick every t1-th one and store its index

in the whole string. This takes O( n

t1 log n) = o(n) bits. Then, given a

query, we divide it by t1 to locate the desired fragment. Hence from now on we can focus on single fragments.

Paweł Gawrychowski String indexing in the Word RAM model II 15 / 29

slide-46
SLIDE 46

Let r be the total number of bits in a fragment. r > t2

1 things are sparse. There can be at most n t2

1 such

fragments, and we can afford to store the index of each

  • ne in such fragment explicitly.

r ≤ t2

1 we cannot repeat the above simple trick, but things are

not very bad, either. The fragment is short and relative indices can be stored. More specifically, we repeat the reasoning, and split into subfragments containing t2 = (log log n)2 ones. For each one we pick, we store its relative index, which takes O( n

t2 log log n) bits in total.

Then, again, we consider the total number bits r in a subfragment.

Paweł Gawrychowski String indexing in the Word RAM model II 16 / 29

slide-47
SLIDE 47

Let r be the total number of bits in a fragment. r > t2

1 things are sparse. There can be at most n t2

1 such

fragments, and we can afford to store the index of each

  • ne in such fragment explicitly.

r ≤ t2

1 we cannot repeat the above simple trick, but things are

not very bad, either. The fragment is short and relative indices can be stored. More specifically, we repeat the reasoning, and split into subfragments containing t2 = (log log n)2 ones. For each one we pick, we store its relative index, which takes O( n

t2 log log n) bits in total.

Then, again, we consider the total number bits r in a subfragment.

Paweł Gawrychowski String indexing in the Word RAM model II 16 / 29

slide-48
SLIDE 48

Let r be the total number of bits in a fragment. r > t2

1 things are sparse. There can be at most n t2

1 such

fragments, and we can afford to store the index of each

  • ne in such fragment explicitly.

r ≤ t2

1 we cannot repeat the above simple trick, but things are

not very bad, either. The fragment is short and relative indices can be stored. More specifically, we repeat the reasoning, and split into subfragments containing t2 = (log log n)2 ones. For each one we pick, we store its relative index, which takes O( n

t2 log log n) bits in total.

Then, again, we consider the total number bits r in a subfragment.

Paweł Gawrychowski String indexing in the Word RAM model II 16 / 29

slide-49
SLIDE 49

r > t2

2 things are sparse, and we store the relative index of each

  • ne. There are at most n

t2

2 such subfragments, each

contains t2 ones, and relative indices take log log n bits. r ≤ t2

2 then r ≤ 1 2 log n, and we use the tabulation trick.

Total space is O(

n log log n) bits.

It often happens in this area that o(n) means “something just a little bit below n”, which is surely not what we would like if the result are to be

  • f any relevance to the real world, but...

P˘ atra¸ scu 2008

For any constant c, rank/select can be implemented in n + O(

n logc n)

bits of space.

Paweł Gawrychowski String indexing in the Word RAM model II 17 / 29

slide-50
SLIDE 50

r > t2

2 things are sparse, and we store the relative index of each

  • ne. There are at most n

t2

2 such subfragments, each

contains t2 ones, and relative indices take log log n bits. r ≤ t2

2 then r ≤ 1 2 log n, and we use the tabulation trick.

Total space is O(

n log log n) bits.

It often happens in this area that o(n) means “something just a little bit below n”, which is surely not what we would like if the result are to be

  • f any relevance to the real world, but...

P˘ atra¸ scu 2008

For any constant c, rank/select can be implemented in n + O(

n logc n)

bits of space.

Paweł Gawrychowski String indexing in the Word RAM model II 17 / 29

slide-51
SLIDE 51

r > t2

2 things are sparse, and we store the relative index of each

  • ne. There are at most n

t2

2 such subfragments, each

contains t2 ones, and relative indices take log log n bits. r ≤ t2

2 then r ≤ 1 2 log n, and we use the tabulation trick.

Total space is O(

n log log n) bits.

It often happens in this area that o(n) means “something just a little bit below n”, which is surely not what we would like if the result are to be

  • f any relevance to the real world, but...

P˘ atra¸ scu 2008

For any constant c, rank/select can be implemented in n + O(

n logc n)

bits of space.

Paweł Gawrychowski String indexing in the Word RAM model II 17 / 29

slide-52
SLIDE 52

r > t2

2 things are sparse, and we store the relative index of each

  • ne. There are at most n

t2

2 such subfragments, each

contains t2 ones, and relative indices take log log n bits. r ≤ t2

2 then r ≤ 1 2 log n, and we use the tabulation trick.

Total space is O(

n log log n) bits.

It often happens in this area that o(n) means “something just a little bit below n”, which is surely not what we would like if the result are to be

  • f any relevance to the real world, but...

P˘ atra¸ scu 2008

For any constant c, rank/select can be implemented in n + O(

n logc n)

bits of space.

Paweł Gawrychowski String indexing in the Word RAM model II 17 / 29

slide-53
SLIDE 53

Now we can store the lcp array and the RMQ structure in 4n + o(n)

  • bits. But we still need to store SA, so we need n log n bits (we might

also need to store SA−1, which is another n log n bits). Now we will see how to decrease this bound!

Paweł Gawrychowski String indexing in the Word RAM model II 18 / 29

slide-54
SLIDE 54

Compressed suffix arrays

A text of length n over Σ can be stored in n log |Σ| bits. Now if Σ is small (think binary), n log n bits taken by the suffix array is way too much.

Compressed suffix arrays

Represent SA in o(n log n) bits of spaces, so that we can efficiently implement lookup(i) which returns SA[i]. (We don’t care about extracting SA−1.)

Grossi and Vitter 2000

For any constant ǫ > 0, SA can be represented using just (1 + 1

ǫ )n log |Σ| + o(n log |Σ|) bits, so that lookup(i) takes O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model II 19 / 29

slide-55
SLIDE 55

Compressed suffix arrays

A text of length n over Σ can be stored in n log |Σ| bits. Now if Σ is small (think binary), n log n bits taken by the suffix array is way too much.

Compressed suffix arrays

Represent SA in o(n log n) bits of spaces, so that we can efficiently implement lookup(i) which returns SA[i]. (We don’t care about extracting SA−1.)

Grossi and Vitter 2000

For any constant ǫ > 0, SA can be represented using just (1 + 1

ǫ )n log |Σ| + o(n log |Σ|) bits, so that lookup(i) takes O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model II 19 / 29

slide-56
SLIDE 56

Compressed suffix arrays

A text of length n over Σ can be stored in n log |Σ| bits. Now if Σ is small (think binary), n log n bits taken by the suffix array is way too much.

Compressed suffix arrays

Represent SA in o(n log n) bits of spaces, so that we can efficiently implement lookup(i) which returns SA[i]. (We don’t care about extracting SA−1.)

Grossi and Vitter 2000

For any constant ǫ > 0, SA can be represented using just (1 + 1

ǫ )n log |Σ| + o(n log |Σ|) bits, so that lookup(i) takes O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model II 19 / 29

slide-57
SLIDE 57

Can we do even better?

The empirical entropy is the average number of bits per symbol needed to encode the text.

Entropy (or zeroth order empirical entropy)

H0(T) =

  • c∈Σ

nc n log n nc where nc is the number of occurrences of character c in T.

k-th order empirical entropy

Hk(T) = 1 n

  • s∈Σk

|Ts|H0(Ts) where Ts is the concatenation of all characters in T following an

  • ccurrence of s.

It is known that Lempel-Ziv compression methods approach the k-th

  • rder empirical entropy.

Paweł Gawrychowski String indexing in the Word RAM model II 20 / 29

slide-58
SLIDE 58

Can we do even better?

The empirical entropy is the average number of bits per symbol needed to encode the text.

Entropy (or zeroth order empirical entropy)

H0(T) =

  • c∈Σ

nc n log n nc where nc is the number of occurrences of character c in T.

k-th order empirical entropy

Hk(T) = 1 n

  • s∈Σk

|Ts|H0(Ts) where Ts is the concatenation of all characters in T following an

  • ccurrence of s.

It is known that Lempel-Ziv compression methods approach the k-th

  • rder empirical entropy.

Paweł Gawrychowski String indexing in the Word RAM model II 20 / 29

slide-59
SLIDE 59

Can we do even better?

The empirical entropy is the average number of bits per symbol needed to encode the text.

Entropy (or zeroth order empirical entropy)

H0(T) =

  • c∈Σ

nc n log n nc where nc is the number of occurrences of character c in T.

k-th order empirical entropy

Hk(T) = 1 n

  • s∈Σk

|Ts|H0(Ts) where Ts is the concatenation of all characters in T following an

  • ccurrence of s.

It is known that Lempel-Ziv compression methods approach the k-th

  • rder empirical entropy.

Paweł Gawrychowski String indexing in the Word RAM model II 20 / 29

slide-60
SLIDE 60

Can we do even better?

The empirical entropy is the average number of bits per symbol needed to encode the text.

Entropy (or zeroth order empirical entropy)

H0(T) =

  • c∈Σ

nc n log n nc where nc is the number of occurrences of character c in T.

k-th order empirical entropy

Hk(T) = 1 n

  • s∈Σk

|Ts|H0(Ts) where Ts is the concatenation of all characters in T following an

  • ccurrence of s.

It is known that Lempel-Ziv compression methods approach the k-th

  • rder empirical entropy.

Paweł Gawrychowski String indexing in the Word RAM model II 20 / 29

slide-61
SLIDE 61

Can we do even better?

Now we would like to represent SA in space proportional to the k-th

  • rder empirical entropy of the text.

Sadakane 2003

For any constant ǫ, ǫ′ > 0, SA can be represented using H0(T)n 1+ǫ′

ǫ

+ n(2 log(1 + H0(T)) + 3) + o(n) bits, so that lookup(i) takes O( 1

ǫǫ′ logǫ n) time, assuming |Σ| = polylog(n).

Grossi, Gupta, Vitter 2003

SA can be represented using Hk(T)n + O(n log |Σ| log log n

log n ) bits.

These bounds are painful to look at, so we will ignore them.

Paweł Gawrychowski String indexing in the Word RAM model II 21 / 29

slide-62
SLIDE 62

Can we do even better?

Now we would like to represent SA in space proportional to the k-th

  • rder empirical entropy of the text.

Sadakane 2003

For any constant ǫ, ǫ′ > 0, SA can be represented using H0(T)n 1+ǫ′

ǫ

+ n(2 log(1 + H0(T)) + 3) + o(n) bits, so that lookup(i) takes O( 1

ǫǫ′ logǫ n) time, assuming |Σ| = polylog(n).

Grossi, Gupta, Vitter 2003

SA can be represented using Hk(T)n + O(n log |Σ| log log n

log n ) bits.

These bounds are painful to look at, so we will ignore them.

Paweł Gawrychowski String indexing in the Word RAM model II 21 / 29

slide-63
SLIDE 63

Can we do even better?

Now we would like to represent SA in space proportional to the k-th

  • rder empirical entropy of the text.

Sadakane 2003

For any constant ǫ, ǫ′ > 0, SA can be represented using H0(T)n 1+ǫ′

ǫ

+ n(2 log(1 + H0(T)) + 3) + o(n) bits, so that lookup(i) takes O( 1

ǫǫ′ logǫ n) time, assuming |Σ| = polylog(n).

Grossi, Gupta, Vitter 2003

SA can be represented using Hk(T)n + O(n log |Σ| log log n

log n ) bits.

These bounds are painful to look at, so we will ignore them.

Paweł Gawrychowski String indexing in the Word RAM model II 21 / 29

slide-64
SLIDE 64

Can we do even better?

Now we would like to represent SA in space proportional to the k-th

  • rder empirical entropy of the text.

Sadakane 2003

For any constant ǫ, ǫ′ > 0, SA can be represented using H0(T)n 1+ǫ′

ǫ

+ n(2 log(1 + H0(T)) + 3) + o(n) bits, so that lookup(i) takes O( 1

ǫǫ′ logǫ n) time, assuming |Σ| = polylog(n).

Grossi, Gupta, Vitter 2003

SA can be represented using Hk(T)n + O(n log |Σ| log log n

log n ) bits.

These bounds are painful to look at, so we will ignore them.

Paweł Gawrychowski String indexing in the Word RAM model II 21 / 29

slide-65
SLIDE 65

Grossi and Vitter

We will assume |Σ| = 2. SA can be represented in 1

2n log log n + 6n + O( n log log n) bits, so that

lookup(i) takes O(log log n) time.

Paweł Gawrychowski String indexing in the Word RAM model II 22 / 29

slide-66
SLIDE 66

SA0 is the suffix array for the original string w = w0. We create a new string w1 by chopping w0 into blocks of two characters w[2]w[3], w[4]w[5], ...,, and treating each such block as a single letter. In other words, we keep only suffixes starting at even positions. SA1 is the suffix array constructed for w1. Is there any relation between SA0 and SA1? In other words, assume that we can perform lookup(i) on SA1. Can we implement lookup(i) on SA0 if we add just a little bit of additional data?

Paweł Gawrychowski String indexing in the Word RAM model II 23 / 29

slide-67
SLIDE 67

SA0 is the suffix array for the original string w = w0. We create a new string w1 by chopping w0 into blocks of two characters w[2]w[3], w[4]w[5], ...,, and treating each such block as a single letter. In other words, we keep only suffixes starting at even positions. SA1 is the suffix array constructed for w1. Is there any relation between SA0 and SA1? In other words, assume that we can perform lookup(i) on SA1. Can we implement lookup(i) on SA0 if we add just a little bit of additional data?

Paweł Gawrychowski String indexing in the Word RAM model II 23 / 29

slide-68
SLIDE 68

SA0 is the suffix array for the original string w = w0. We create a new string w1 by chopping w0 into blocks of two characters w[2]w[3], w[4]w[5], ...,, and treating each such block as a single letter. In other words, we keep only suffixes starting at even positions. SA1 is the suffix array constructed for w1. Is there any relation between SA0 and SA1? In other words, assume that we can perform lookup(i) on SA1. Can we implement lookup(i) on SA0 if we add just a little bit of additional data?

Paweł Gawrychowski String indexing in the Word RAM model II 23 / 29

slide-69
SLIDE 69

×

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a # SA0: 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25 B0: 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 rank 0: 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16 Ψ0: 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA1: 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1

If SA0[i] is even, then we return 2 · SA1[i′], where i′ is the number

  • f even suffixes in SA0[1..i].

2

If SA0[i] is odd, then we return 2 · SA1[i′] − 1, where i′ is the number of even suffixes in SA0[1..j], where SA0[i] = SA0[j] − 1. Ψ0(i) =

  • i

if SA0[i] is even j if SA0[i] = SA0[j] is odd In both cases, augmenting B0 with a rank structure reduces the problem to storing Ψ0 in small space.

Paweł Gawrychowski String indexing in the Word RAM model II 24 / 29

slide-70
SLIDE 70

×

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a # SA0: 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25 B0: 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 rank 0: 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16 Ψ0: 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA1: 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1

If SA0[i] is even, then we return 2 · SA1[i′], where i′ is the number

  • f even suffixes in SA0[1..i].

2

If SA0[i] is odd, then we return 2 · SA1[i′] − 1, where i′ is the number of even suffixes in SA0[1..j], where SA0[i] = SA0[j] − 1. Ψ0(i) =

  • i

if SA0[i] is even j if SA0[i] = SA0[j] is odd In both cases, augmenting B0 with a rank structure reduces the problem to storing Ψ0 in small space.

Paweł Gawrychowski String indexing in the Word RAM model II 24 / 29

slide-71
SLIDE 71

×

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a # SA0: 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25 B0: 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 rank 0: 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16 Ψ0: 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA1: 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1

If SA0[i] is even, then we return 2 · SA1[i′], where i′ is the number

  • f even suffixes in SA0[1..i].

2

If SA0[i] is odd, then we return 2 · SA1[i′] − 1, where i′ is the number of even suffixes in SA0[1..j], where SA0[i] = SA0[j] − 1. Ψ0(i) =

  • i

if SA0[i] is even j if SA0[i] = SA0[j] is odd In both cases, augmenting B0 with a rank structure reduces the problem to storing Ψ0 in small space.

Paweł Gawrychowski String indexing in the Word RAM model II 24 / 29

slide-72
SLIDE 72

×

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a # SA0: 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25 B0: 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 rank 0: 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16 Ψ0: 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA1: 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1

If SA0[i] is even, then we return 2 · SA1[i′], where i′ is the number

  • f even suffixes in SA0[1..i].

2

If SA0[i] is odd, then we return 2 · SA1[i′] − 1, where i′ is the number of even suffixes in SA0[1..j], where SA0[i] = SA0[j] − 1. Ψ0(i) =

  • i

if SA0[i] is even j if SA0[i] = SA0[j] is odd In both cases, augmenting B0 with a rank structure reduces the problem to storing Ψ0 in small space.

Paweł Gawrychowski String indexing in the Word RAM model II 24 / 29

slide-73
SLIDE 73

×

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a # SA0: 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25 B0: 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 rank 0: 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16 Ψ0: 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA1: 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1

If SA0[i] is even, then we return 2 · SA1[i′], where i′ is the number

  • f even suffixes in SA0[1..i].

2

If SA0[i] is odd, then we return 2 · SA1[i′] − 1, where i′ is the number of even suffixes in SA0[1..j], where SA0[i] = SA0[j] − 1. Ψ0(i) =

  • i

if SA0[i] is even j if SA0[i] = SA0[j] is odd In both cases, augmenting B0 with a rank structure reduces the problem to storing Ψ0 in small space.

Paweł Gawrychowski String indexing in the Word RAM model II 24 / 29

slide-74
SLIDE 74

Storing Ψ0

Ψ0[i] is the position of the even successor of SA0[i] in the suffix array. We need to compress all Ψ0[i] corresponding to odd suffixes. But the values don’t seem to have any special structure... Or do they? Let’s look at Ψ0[i] such that B0[i] = 0 and T[SA[i]] = a. The indices are: 1, 3, 4, 5, 6, 9, 11, 12 and the values are: 2, 14, 15, 18, 23, 28, 30, 31 So, all Ψ0[i] such that B0[i] = 0 can be decomposed into two increasing lists. If the alphabet is larger, we just have more lists!

Paweł Gawrychowski String indexing in the Word RAM model II 25 / 29

slide-75
SLIDE 75

Storing Ψ0

Ψ0[i] is the position of the even successor of SA0[i] in the suffix array. We need to compress all Ψ0[i] corresponding to odd suffixes. But the values don’t seem to have any special structure... Or do they? Let’s look at Ψ0[i] such that B0[i] = 0 and T[SA[i]] = a. The indices are: 1, 3, 4, 5, 6, 9, 11, 12 and the values are: 2, 14, 15, 18, 23, 28, 30, 31 So, all Ψ0[i] such that B0[i] = 0 can be decomposed into two increasing lists. If the alphabet is larger, we just have more lists!

Paweł Gawrychowski String indexing in the Word RAM model II 25 / 29

slide-76
SLIDE 76

Storing Ψ0

Ψ0[i] is the position of the even successor of SA0[i] in the suffix array. We need to compress all Ψ0[i] corresponding to odd suffixes. But the values don’t seem to have any special structure... Or do they? Let’s look at Ψ0[i] such that B0[i] = 0 and T[SA[i]] = a. The indices are: 1, 3, 4, 5, 6, 9, 11, 12 and the values are: 2, 14, 15, 18, 23, 28, 30, 31 So, all Ψ0[i] such that B0[i] = 0 can be decomposed into two increasing lists. If the alphabet is larger, we just have more lists!

Paweł Gawrychowski String indexing in the Word RAM model II 25 / 29

slide-77
SLIDE 77

Storing Ψ0

We generate a list of pairs (T[SA0[i]], Ψ0[i]) for all i such that B0[i] = 0. So, to store all Ψ0[i] in small space, it is enough to show how to store an increasing list of numbers. This sounds easier, as storing an increasing list is easier than storing an arbitrary list! (We have seen it a few slides ago.)

Recursion

We will recurse on SA0, SA1, SA2, SA3, .... In SAk, our alphabet is of size 22k, because we are operating on blocks of 2k characters from the

  • riginal text. So storing Ψk reduces to storing an increasing list of nk

2

numbers consisting of 2k + log nk bits, where nk = n

2k .

Paweł Gawrychowski String indexing in the Word RAM model II 26 / 29

slide-78
SLIDE 78

Storing Ψ0

We generate a list of pairs (T[SA0[i]], Ψ0[i]) for all i such that B0[i] = 0. So, to store all Ψ0[i] in small space, it is enough to show how to store an increasing list of numbers. This sounds easier, as storing an increasing list is easier than storing an arbitrary list! (We have seen it a few slides ago.)

Recursion

We will recurse on SA0, SA1, SA2, SA3, .... In SAk, our alphabet is of size 22k, because we are operating on blocks of 2k characters from the

  • riginal text. So storing Ψk reduces to storing an increasing list of nk

2

numbers consisting of 2k + log nk bits, where nk = n

2k .

Paweł Gawrychowski String indexing in the Word RAM model II 26 / 29

slide-79
SLIDE 79

Lemma

A list of nk

2 numbers consisting of 2k + log nk bits can be stored in 1 2n + 3 2nk + O( nk log log nk ) bits of space.

We split every number into a prefix of length log nk and the remaining part. The suffixes are stored naively, taking 2k bits each, so 2k nk

2 = n 2 in

total. The prefixes are nondecreasing, so we store their differences. The differences are encoded in unary (as in the lcp representation), taking nk + 1

2nk = 3 2nk bits in total.

We augment the representation of the prefixes with a rank/select structure, so that we can extract any prefix in O(1) time. This adds O(

nk log log nk ) bits.

Paweł Gawrychowski String indexing in the Word RAM model II 27 / 29

slide-80
SLIDE 80

Lemma

A list of nk

2 numbers consisting of 2k + log nk bits can be stored in 1 2n + 3 2nk + O( nk log log nk ) bits of space.

We split every number into a prefix of length log nk and the remaining part. The suffixes are stored naively, taking 2k bits each, so 2k nk

2 = n 2 in

total. The prefixes are nondecreasing, so we store their differences. The differences are encoded in unary (as in the lcp representation), taking nk + 1

2nk = 3 2nk bits in total.

We augment the representation of the prefixes with a rank/select structure, so that we can extract any prefix in O(1) time. This adds O(

nk log log nk ) bits.

Paweł Gawrychowski String indexing in the Word RAM model II 27 / 29

slide-81
SLIDE 81

Lemma

A list of nk

2 numbers consisting of 2k + log nk bits can be stored in 1 2n + 3 2nk + O( nk log log nk ) bits of space.

We split every number into a prefix of length log nk and the remaining part. The suffixes are stored naively, taking 2k bits each, so 2k nk

2 = n 2 in

total. The prefixes are nondecreasing, so we store their differences. The differences are encoded in unary (as in the lcp representation), taking nk + 1

2nk = 3 2nk bits in total.

We augment the representation of the prefixes with a rank/select structure, so that we can extract any prefix in O(1) time. This adds O(

nk log log nk ) bits.

Paweł Gawrychowski String indexing in the Word RAM model II 27 / 29

slide-82
SLIDE 82

Lemma

A list of nk

2 numbers consisting of 2k + log nk bits can be stored in 1 2n + 3 2nk + O( nk log log nk ) bits of space.

We split every number into a prefix of length log nk and the remaining part. The suffixes are stored naively, taking 2k bits each, so 2k nk

2 = n 2 in

total. The prefixes are nondecreasing, so we store their differences. The differences are encoded in unary (as in the lcp representation), taking nk + 1

2nk = 3 2nk bits in total.

We augment the representation of the prefixes with a rank/select structure, so that we can extract any prefix in O(1) time. This adds O(

nk log log nk ) bits.

Paweł Gawrychowski String indexing in the Word RAM model II 27 / 29

slide-83
SLIDE 83

Final space bound

We use such encoding at every level. When nk ≤

n log n we terminate

and switch to the naive representation, so there are log log n levels. Then the total space (in bits) is: n log n log n +

log log n

  • i=0

1 2n + 3 2nk + O( nk log log nk ) and the query time is O(log log n).

Paweł Gawrychowski String indexing in the Word RAM model II 28 / 29

slide-84
SLIDE 84

Questions?

Paweł Gawrychowski String indexing in the Word RAM model II 29 / 29