String indexing in the Word RAM model, part 3 Pawe Gawrychowski - - PowerPoint PPT Presentation

string indexing in the word ram model part 3
SMART_READER_LITE
LIVE PREVIEW

String indexing in the Word RAM model, part 3 Pawe Gawrychowski - - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model III 1 / 30 We want to reduce the space usage. The goal


slide-1
SLIDE 1

String indexing in the Word RAM model, part 3

Paweł Gawrychowski

University of Wrocław & Max-Planck-Institut für Informatik

Paweł Gawrychowski String indexing in the Word RAM model III 1 / 30

slide-2
SLIDE 2

We want to reduce the space usage. The goal will be to construct a structure of size (1 + 1

ǫ )n + O( n log log n) allowing answering any

lookup(i) in O(logǫ n) time, for any ǫ ∈ (0, 1].

Idea

We had ℓ = log log n levels of recursion. Now we will try to simulate jumping ǫℓ levels at once, so that we only have to store 1

ǫ levels.

Paweł Gawrychowski String indexing in the Word RAM model III 2 / 30

slide-3
SLIDE 3

We want to reduce the space usage. The goal will be to construct a structure of size (1 + 1

ǫ )n + O( n log log n) allowing answering any

lookup(i) in O(logǫ n) time, for any ǫ ∈ (0, 1].

Idea

We had ℓ = log log n levels of recursion. Now we will try to simulate jumping ǫℓ levels at once, so that we only have to store 1

ǫ levels.

Paweł Gawrychowski String indexing in the Word RAM model III 2 / 30

slide-4
SLIDE 4

The first step is to replace Ψk with Φk. Φk(i) =

  • j

if SAk[j] = SAk[i] + 1 1 if SAk[i] = nk So, we store the successor for every SAk[i]. Now if we store all Φk(i) in a list, then computing Φk(i) is really taking the ith element of a list, and the vectors Bk are no longer necessary.

Paweł Gawrychowski String indexing in the Word RAM model III 3 / 30

slide-5
SLIDE 5

The first step is to replace Ψk with Φk. Φk(i) =

  • j

if SAk[j] = SAk[i] + 1 1 if SAk[i] = nk So, we store the successor for every SAk[i]. Now if we store all Φk(i) in a list, then computing Φk(i) is really taking the ith element of a list, and the vectors Bk are no longer necessary.

Paweł Gawrychowski String indexing in the Word RAM model III 3 / 30

slide-6
SLIDE 6

Lemma

Φ0 can be stored in n + O(

n log log n) bits, so that accessing any entry

takes O(1) time.

Lemma

For k > 0, Φk can be stored in n(1 +

1 2k−1 ) + O( n 2k log log n) bits, so that

accessing any entry takes O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model III 4 / 30

slide-7
SLIDE 7

Lemma

Φ0 can be stored in n + O(

n log log n) bits, so that accessing any entry

takes O(1) time.

Lemma

For k > 0, Φk can be stored in n(1 +

1 2k−1 ) + O( n 2k log log n) bits, so that

accessing any entry takes O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model III 4 / 30

slide-8
SLIDE 8

Now to determine SA[i] = SA0[i], we use Ψ0 to walk along indices i, i′, i′′, ... such that SA0[i] + 1 = SA0[i′], SA0[i′] + 1 = SA0[i′′], ... until we reach an index stored in SA1. But how to detect this?

Succinct dictionary

A bit vector B[1..n], where only n′ elements are ones, can be stored in O(log n

n′

  • ) bits, so that a lookup and a rank take O(1) time.

So we store all i such that SA0[i] is visible by 2ǫℓ in such succinct

  • dictionary. The length of such walk is at most 2ǫℓ = O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model III 5 / 30

slide-9
SLIDE 9

Now to determine SA[i] = SA0[i], we use Ψ0 to walk along indices i, i′, i′′, ... such that SA0[i] + 1 = SA0[i′], SA0[i′] + 1 = SA0[i′′], ... until we reach an index stored in SA1. But how to detect this?

Succinct dictionary

A bit vector B[1..n], where only n′ elements are ones, can be stored in O(log n

n′

  • ) bits, so that a lookup and a rank take O(1) time.

So we store all i such that SA0[i] is visible by 2ǫℓ in such succinct

  • dictionary. The length of such walk is at most 2ǫℓ = O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model III 5 / 30

slide-10
SLIDE 10

Now to determine SA[i] = SA0[i], we use Ψ0 to walk along indices i, i′, i′′, ... such that SA0[i] + 1 = SA0[i′], SA0[i′] + 1 = SA0[i′′], ... until we reach an index stored in SA1. But how to detect this?

Succinct dictionary

A bit vector B[1..n], where only n′ elements are ones, can be stored in O(log n

n′

  • ) bits, so that a lookup and a rank take O(1) time.

So we store all i such that SA0[i] is visible by 2ǫℓ in such succinct

  • dictionary. The length of such walk is at most 2ǫℓ = O(logǫ n).

Paweł Gawrychowski String indexing in the Word RAM model III 5 / 30

slide-11
SLIDE 11

But what is the space bound? n log n 2ℓ + n + O( n log log n) +

  • k=iǫℓ,0<i<ǫ−1

n(1 + 1 2k−1 + O( 1 2k log log n)) plus the space taken by the succinct dictionaries, which is O(nǫℓℓ) = O(n log log n

logǫ n ), so we get the claimed space complexity. The

space taken by the dictionaries is bounded as follows:

1

for k = 0, O(log n

nǫℓ

  • ),

2

generally at the kth super level we need O(log

  • nkǫℓ

n(k+1)ǫℓ

  • ).

which is O(nkǫℓǫℓ).

Paweł Gawrychowski String indexing in the Word RAM model III 6 / 30

slide-12
SLIDE 12

But what is the space bound? n log n 2ℓ + n + O( n log log n) +

  • k=iǫℓ,0<i<ǫ−1

n(1 + 1 2k−1 + O( 1 2k log log n)) plus the space taken by the succinct dictionaries, which is O(nǫℓℓ) = O(n log log n

logǫ n ), so we get the claimed space complexity. The

space taken by the dictionaries is bounded as follows:

1

for k = 0, O(log n

nǫℓ

  • ),

2

generally at the kth super level we need O(log

  • nkǫℓ

n(k+1)ǫℓ

  • ).

which is O(nkǫℓǫℓ).

Paweł Gawrychowski String indexing in the Word RAM model III 6 / 30

slide-13
SLIDE 13

But what is the space bound? n log n 2ℓ + n + O( n log log n) +

  • k=iǫℓ,0<i<ǫ−1

n(1 + 1 2k−1 + O( 1 2k log log n)) plus the space taken by the succinct dictionaries, which is O(nǫℓℓ) = O(n log log n

logǫ n ), so we get the claimed space complexity. The

space taken by the dictionaries is bounded as follows:

1

for k = 0, O(log n

nǫℓ

  • ),

2

generally at the kth super level we need O(log

  • nkǫℓ

n(k+1)ǫℓ

  • ).

which is O(nkǫℓǫℓ).

Paweł Gawrychowski String indexing in the Word RAM model III 6 / 30

slide-14
SLIDE 14

But what is the space bound? n log n 2ℓ + n + O( n log log n) +

  • k=iǫℓ,0<i<ǫ−1

n(1 + 1 2k−1 + O( 1 2k log log n)) plus the space taken by the succinct dictionaries, which is O(nǫℓℓ) = O(n log log n

logǫ n ), so we get the claimed space complexity. The

space taken by the dictionaries is bounded as follows:

1

for k = 0, O(log n

nǫℓ

  • ),

2

generally at the kth super level we need O(log

  • nkǫℓ

n(k+1)ǫℓ

  • ).

which is O(nkǫℓǫℓ).

Paweł Gawrychowski String indexing in the Word RAM model III 6 / 30

slide-15
SLIDE 15

But what is the space bound? n log n 2ℓ + n + O( n log log n) +

  • k=iǫℓ,0<i<ǫ−1

n(1 + 1 2k−1 + O( 1 2k log log n)) plus the space taken by the succinct dictionaries, which is O(nǫℓℓ) = O(n log log n

logǫ n ), so we get the claimed space complexity. The

space taken by the dictionaries is bounded as follows:

1

for k = 0, O(log n

nǫℓ

  • ),

2

generally at the kth super level we need O(log

  • nkǫℓ

n(k+1)ǫℓ

  • ).

which is O(nkǫℓǫℓ).

Paweł Gawrychowski String indexing in the Word RAM model III 6 / 30

slide-16
SLIDE 16

Succinct dictionaries

Pagh 2001

A static dictionary storing a subset of [1, U] of size n can be stored in B + O(log log U) + o(n) bits of space, so that a membership query can be answered in O(1) time. where B =

  • log2

U

n

  • . We can also add O(1) time rank queries, but it

requires a little bit of work. We will see a (small fragment) of a much weaker result.

Brodnik and Munro 1999

A static dictionary storing a subset of [1, U] of size n can be stored in O(B) bits of space, so that a membership query can be answered in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model III 7 / 30

slide-17
SLIDE 17

Succinct dictionaries

Pagh 2001

A static dictionary storing a subset of [1, U] of size n can be stored in B + O(log log U) + o(n) bits of space, so that a membership query can be answered in O(1) time. where B =

  • log2

U

n

  • . We can also add O(1) time rank queries, but it

requires a little bit of work. We will see a (small fragment) of a much weaker result.

Brodnik and Munro 1999

A static dictionary storing a subset of [1, U] of size n can be stored in O(B) bits of space, so that a membership query can be answered in O(1) time.

Paweł Gawrychowski String indexing in the Word RAM model III 7 / 30

slide-18
SLIDE 18

We allow O(B) = O(n log U

n ) bits of space. We can clearly encode the

whole set in such space, but the question is whether we can answer a membership query efficiently! Let r = U

n . We consider four cases:

very sparse r ∈ [Uǫ, ∞], then we have O(n log U) space, so we can explicitly list of the elements. We use some form of perfect hashing. moderately sparse r ∈ [logλ U, Uǫ], see the next slide. moderately dense r ∈ [ 1

α, logλ U], complicated!

dense r ∈ [2, 1

α], then we can use O(U) bits of space, so we

store a bitmap.

Paweł Gawrychowski String indexing in the Word RAM model III 8 / 30

slide-19
SLIDE 19

We allow O(B) = O(n log U

n ) bits of space. We can clearly encode the

whole set in such space, but the question is whether we can answer a membership query efficiently! Let r = U

n . We consider four cases:

very sparse r ∈ [Uǫ, ∞], then we have O(n log U) space, so we can explicitly list of the elements. We use some form of perfect hashing. moderately sparse r ∈ [logλ U, Uǫ], see the next slide. moderately dense r ∈ [ 1

α, logλ U], complicated!

dense r ∈ [2, 1

α], then we can use O(U) bits of space, so we

store a bitmap.

Paweł Gawrychowski String indexing in the Word RAM model III 8 / 30

slide-20
SLIDE 20

Lemma

If n ≤

U logλ U , then we can store the set in O(B) bits of space, so that a

membership query takes O(1) time. We split the universe into p =

n log U buckets. We store the pointers to all

buckets, which takes p log U bits of space. In each bucket, we store the elements in log U

p bits per element.

Additionally, we keep a perfect hashing function, separately for every bucket.

Fiat et al. 1988

A perfect hashing function for n elements taken from [1, U] can be implemented in O(log n + log log U + 1) additional bits of space. It all adds up to B + B log log M

log r

+ o(B) bits of space, which is OK as long as r is at least logλ M.

Paweł Gawrychowski String indexing in the Word RAM model III 9 / 30

slide-21
SLIDE 21

Lemma

If n ≤

U logλ U , then we can store the set in O(B) bits of space, so that a

membership query takes O(1) time. We split the universe into p =

n log U buckets. We store the pointers to all

buckets, which takes p log U bits of space. In each bucket, we store the elements in log U

p bits per element.

Additionally, we keep a perfect hashing function, separately for every bucket.

Fiat et al. 1988

A perfect hashing function for n elements taken from [1, U] can be implemented in O(log n + log log U + 1) additional bits of space. It all adds up to B + B log log M

log r

+ o(B) bits of space, which is OK as long as r is at least logλ M.

Paweł Gawrychowski String indexing in the Word RAM model III 9 / 30

slide-22
SLIDE 22

Lemma

If n ≤

U logλ U , then we can store the set in O(B) bits of space, so that a

membership query takes O(1) time. We split the universe into p =

n log U buckets. We store the pointers to all

buckets, which takes p log U bits of space. In each bucket, we store the elements in log U

p bits per element.

Additionally, we keep a perfect hashing function, separately for every bucket.

Fiat et al. 1988

A perfect hashing function for n elements taken from [1, U] can be implemented in O(log n + log log U + 1) additional bits of space. It all adds up to B + B log log M

log r

+ o(B) bits of space, which is OK as long as r is at least logλ M.

Paweł Gawrychowski String indexing in the Word RAM model III 9 / 30

slide-23
SLIDE 23

Lemma

If n ≤

U logλ U , then we can store the set in O(B) bits of space, so that a

membership query takes O(1) time. We split the universe into p =

n log U buckets. We store the pointers to all

buckets, which takes p log U bits of space. In each bucket, we store the elements in log U

p bits per element.

Additionally, we keep a perfect hashing function, separately for every bucket.

Fiat et al. 1988

A perfect hashing function for n elements taken from [1, U] can be implemented in O(log n + log log U + 1) additional bits of space. It all adds up to B + B log log M

log r

+ o(B) bits of space, which is OK as long as r is at least logλ M.

Paweł Gawrychowski String indexing in the Word RAM model III 9 / 30

slide-24
SLIDE 24

Lemma

If n ≤

U logλ U , then we can store the set in O(B) bits of space, so that a

membership query takes O(1) time. We split the universe into p =

n log U buckets. We store the pointers to all

buckets, which takes p log U bits of space. In each bucket, we store the elements in log U

p bits per element.

Additionally, we keep a perfect hashing function, separately for every bucket.

Fiat et al. 1988

A perfect hashing function for n elements taken from [1, U] can be implemented in O(log n + log log U + 1) additional bits of space. It all adds up to B + B log log M

log r

+ o(B) bits of space, which is OK as long as r is at least logλ M.

Paweł Gawrychowski String indexing in the Word RAM model III 9 / 30

slide-25
SLIDE 25

So, we have seen suffix arrays (and compressed suffix arrays). The annoying thing about suffix arrays is that we pay some additional penalty of log n (or even more) for every query. Is this necessary?

NO!

We can use suffix trees.

Paweł Gawrychowski String indexing in the Word RAM model III 10 / 30

slide-26
SLIDE 26

So, we have seen suffix arrays (and compressed suffix arrays). The annoying thing about suffix arrays is that we pay some additional penalty of log n (or even more) for every query. Is this necessary?

NO!

We can use suffix trees.

Paweł Gawrychowski String indexing in the Word RAM model III 10 / 30

slide-27
SLIDE 27

So, we have seen suffix arrays (and compressed suffix arrays). The annoying thing about suffix arrays is that we pay some additional penalty of log n (or even more) for every query. Is this necessary?

NO!

We can use suffix trees.

Paweł Gawrychowski String indexing in the Word RAM model III 10 / 30

slide-28
SLIDE 28

Suffix tree ST(w[1..n])

We append a special terminating character $ to our word w[1..n]. Then we arrange all suffixes of w[1..n]$ in a compacted trie. Take a banana. The suffixes are $, a$, na$, ana$, nana$, anana$, banana$.

Paweł Gawrychowski String indexing in the Word RAM model III 11 / 30

slide-29
SLIDE 29

Suffix tree ST(w[1..n])

We append a special terminating character $ to our word w[1..n]. Then we arrange all suffixes of w[1..n]$ in a compacted trie. Take a banana. The suffixes are $, a$, na$, ana$, nana$, anana$, banana$.

Paweł Gawrychowski String indexing in the Word RAM model III 11 / 30

slide-30
SLIDE 30

Suffix tree ST(w[1..n])

We append a special terminating character $ to our word w[1..n]. Then we arrange all suffixes of w[1..n]$ in a compacted trie. Take a banana. The suffixes are $, a$, na$, ana$, nana$, anana$, banana$.

a $ na $ n a $ b a n a n a $ na na$ $

Paweł Gawrychowski String indexing in the Word RAM model III 11 / 30

slide-31
SLIDE 31

Why?

The resulting structure represents all subwords of w[1..n]. Each such subword is an explicit or implicit node of the suffix tree.

Paweł Gawrychowski String indexing in the Word RAM model III 12 / 30

slide-32
SLIDE 32

Why?

The resulting structure represents all subwords of w[1..n]. Each such subword is an explicit or implicit node of the suffix tree.

a $ na $ n a $ b a n a n a $ n a na$ $

Paweł Gawrychowski String indexing in the Word RAM model III 12 / 30

slide-33
SLIDE 33

Why?

The resulting structure represents all subwords of w[1..n]. Each such subword is an explicit or implicit node of the suffix tree.

a $ na $ n a $ n a na$ $ b a n a n a $

Paweł Gawrychowski String indexing in the Word RAM model III 12 / 30

slide-34
SLIDE 34

So, a suffix tree allows us to index the the input word.

Text indexing

Given a word w[1..n], construct a small structure allowing to answer queries of the form “where does p[1..m] occur in w[1..n]?”. We keep only the explicit nodes, there are n of them. The labels of the edges are not kept explicitly, we just remember where do they occur in w[1..n]. The total size of the structure is O(n) and a query can be answered in O(m + occ) time.

Paweł Gawrychowski String indexing in the Word RAM model III 13 / 30

slide-35
SLIDE 35

So, a suffix tree allows us to index the the input word.

Text indexing

Given a word w[1..n], construct a small structure allowing to answer queries of the form “where does p[1..m] occur in w[1..n]?”. We keep only the explicit nodes, there are n of them. The labels of the edges are not kept explicitly, we just remember where do they occur in w[1..n]. The total size of the structure is O(n) and a query can be answered in O(m + occ) time.

Paweł Gawrychowski String indexing in the Word RAM model III 13 / 30

slide-36
SLIDE 36

So, a suffix tree allows us to index the the input word.

Text indexing

Given a word w[1..n], construct a small structure allowing to answer queries of the form “where does p[1..m] occur in w[1..n]?”. We keep only the explicit nodes, there are n of them. The labels of the edges are not kept explicitly, we just remember where do they occur in w[1..n]. The total size of the structure is O(n) and a query can be answered in O(m + occ) time.

Paweł Gawrychowski String indexing in the Word RAM model III 13 / 30

slide-37
SLIDE 37

So, a suffix tree allows us to index the the input word.

Text indexing

Given a word w[1..n], construct a small structure allowing to answer queries of the form “where does p[1..m] occur in w[1..n]?”. We keep only the explicit nodes, there are n of them. The labels of the edges are not kept explicitly, we just remember where do they occur in w[1..n]. The total size of the structure is O(n) and a query can be answered in O(m + occ) time.

Paweł Gawrychowski String indexing in the Word RAM model III 13 / 30

slide-38
SLIDE 38

We consider a fundamental data structure question: how to represent a tree?

(Compacted) Trie

A trie is simply a tree with edges labeled by single characters. A compacted trie is created by replacing maximal chains of unary vertices with single edges labeled by (possibly long) words.

Navigation queries

Given a pattern p, we want to traverse the edges of a compacted trie to find the node corresponding to p. If there is no such node, we would like to compute its longest prefix for which the corresponding node does exist.

Paweł Gawrychowski String indexing in the Word RAM model III 14 / 30

slide-39
SLIDE 39

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model III 15 / 30

slide-40
SLIDE 40

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model III 15 / 30

slide-41
SLIDE 41

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hyugfecvbx

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d

Paweł Gawrychowski String indexing in the Word RAM model III 15 / 30

slide-42
SLIDE 42

Consider p = wewpxcwrehyzrt and the following compacted trie.

qoidkbasdk wewpxc

e r w t r q w n b

  • g

tovndfed hy

qtkjdknewnbog povmnxd

khjkdjd c m v n f d d ugfecvbx

Paweł Gawrychowski String indexing in the Word RAM model III 15 / 30

slide-43
SLIDE 43

Splitting an edge

Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.

abrakadabra

Notice that this covers adding a new edge outgoing from an existing node.

Paweł Gawrychowski String indexing in the Word RAM model III 16 / 30

slide-44
SLIDE 44

Splitting an edge

Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node.

abrakadabra x y z

Notice that this covers adding a new edge outgoing from an existing node.

Paweł Gawrychowski String indexing in the Word RAM model III 16 / 30

slide-45
SLIDE 45

Static case

Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently?

Dynamic case

Can we maintain a compacted trie so that:

1

the resulting structure is small,

2

we can execute navigation queries efficiently,

3

we can split any edge efficiently? There are clearly three parameters: the number of nodes in the compacted trie n, the size of the alphabet σ, and the length of the pattern m. We aim to achieve good bounds in terms of those n, σ, m.

Paweł Gawrychowski String indexing in the Word RAM model III 17 / 30

slide-46
SLIDE 46

So, what would be your first idea?

Hashing

For each node store a hash table mapping characters to the corresponding outgoing edges. Randomized!

Table

Or, for each node store a table of size σ mapping characters to the corresponding outgoing edges. Space usage is nσ!

BST

Or, for each node store a binary search tree mapping characters to the corresponding outgoing edges. Navigation query takes O(m log σ) time!

Paweł Gawrychowski String indexing in the Word RAM model III 18 / 30

slide-47
SLIDE 47

To make life interesting, the rules of the game are as follows:

1

the solution must be deterministic,

2

the space usage must be linear in n, irrespectively of σ,

3

bound on the update time must be worst-case. Then it seems that navigation queries must necessarily take O(mf(σ)) time, for some function of σ, for instance f(σ) = log σ, or something better if we use a more sophisticated predecessor structure. Surprisingly, this is not true.

Suffix trays of Cole, Kopelowitz, and Lewenstein ICALP’06

There exists a deterministic linear-size structure supporting navigation in O(m + log σ) time, which can be constructed in linear time.

Paweł Gawrychowski String indexing in the Word RAM model III 19 / 30

slide-48
SLIDE 48

What about the updates?

Suffix trists of Cole, Kopelowitz, and Lewenstein ICALP’06

There exists a deterministic linear-size structure supporting navigation in O(m + log σ) time and splitting edges in O(log σ). The above bound assumes that we are given a pointer to the edge that should be split. The most natural setting where we could use the structure is maintaining a suffix tree for a text which can be updated by prepending a letter. In such case it is rather easy to locate the relevant edge in amortized O(1) time, but getting a sublinear worst-case bound is not trivial!

Paweł Gawrychowski String indexing in the Word RAM model III 20 / 30

slide-49
SLIDE 49

Suffix tree oracle of Amir, Kopelowitz, Lewenstein, and Lewenstein SPIRE’05

There exists a suffix tree oracle which locates the edge in O(log n) time.

Suffix tree oracle of Breslauer and Italiano SPIRE’11

If σ = O(1), there exists a suffix tree oracle which locates the edge in O(log log n) time.

Paweł Gawrychowski String indexing in the Word RAM model III 21 / 30

slide-50
SLIDE 50

The natural question is if the O(m + log σ) and O(log σ) bounds are the best possible. The answer is... no, they are not.

Andersson and Thorup SODA’01

There exists a deterministic linear-size structure supporting navigation in O(m +

  • log n

log log n) time and splitting edges in O(

  • log n

log log n).

Are these bounds are the best possible?

Under some assumptions, yes. More specifically, they are the best possible if σ is unbounded in terms of n, and we are interested in stronger version of the navigation queries, which actually gives us the predecessor of the string we are searching for.

Paweł Gawrychowski String indexing in the Word RAM model III 22 / 30

slide-51
SLIDE 51

The natural question is if the O(m + log σ) and O(log σ) bounds are the best possible. The answer is... no, they are not.

Andersson and Thorup SODA’01

There exists a deterministic linear-size structure supporting navigation in O(m +

  • log n

log log n) time and splitting edges in O(

  • log n

log log n).

Are these bounds are the best possible?

Under some assumptions, yes. More specifically, they are the best possible if σ is unbounded in terms of n, and we are interested in stronger version of the navigation queries, which actually gives us the predecessor of the string we are searching for.

Paweł Gawrychowski String indexing in the Word RAM model III 22 / 30

slide-52
SLIDE 52

The natural question is if the O(m + log σ) and O(log σ) bounds are the best possible. The answer is... no, they are not.

Andersson and Thorup SODA’01

There exists a deterministic linear-size structure supporting navigation in O(m +

  • log n

log log n) time and splitting edges in O(

  • log n

log log n).

Are these bounds are the best possible?

Under some assumptions, yes. More specifically, they are the best possible if σ is unbounded in terms of n, and we are interested in stronger version of the navigation queries, which actually gives us the predecessor of the string we are searching for.

Paweł Gawrychowski String indexing in the Word RAM model III 22 / 30

slide-53
SLIDE 53

But it seems reasonable to consider the scenario where σ is non-constant, yet (significantly) smaller than n. Hence we get the following question: what are the best possible time bounds in terms of σ?

Gawrychowski and Fischer, wait till the next slide

There exists a static deterministic linear-size structure supporting navigation in O(m + log log σ) time, which can be constructed in linear time.

Gawrychowski and Fischer

There exists a deterministic linear-size structure supporting navigation in O(m +

log2 log σ log log log σ) time and splitting edges in O( log2 log σ log log log σ).

Paweł Gawrychowski String indexing in the Word RAM model III 23 / 30

slide-54
SLIDE 54

To construct a static deterministic linear-size structure, we could simply to try to find a perfect hashing function storing all pairs (node, character). It is well-known that such functions can be found in polynomial time, but we need linear time.

Ruži´ c ICALP’08

A static linear-size constant-access dictionary on a set of k keys can be deterministically constructed in time O(k log2 log k). Hence we immediately get a static deterministic structure which can be constructed in close-to-linear time. Can we do better?

Paweł Gawrychowski String indexing in the Word RAM model III 24 / 30

slide-55
SLIDE 55

We store the edges outgoing from v in a few different ways depending

  • n the size of the subtree rooted at v.

Heavy nodes

A node is heavy if its subtree contains at least s = Θ(log2 log σ) leaves, and otherwise light. Furthermore, a heavy node is branching if it has more than one heavy child.

Paweł Gawrychowski String indexing in the Word RAM model III 25 / 30

slide-56
SLIDE 56

heavy light heavy leaf branching nonbranching v pv

Paweł Gawrychowski String indexing in the Word RAM model III 26 / 30

slide-57
SLIDE 57

We classify edges outgoing from heavy nodes into three types, and deal with each type separately:

1

from (any) heavy node to a light node,

2

from a nonbranching heavy node to (any) heavy node,

3

from a branching heavy node to (any) heavy node,

Paweł Gawrychowski String indexing in the Word RAM model III 27 / 30

slide-58
SLIDE 58

We classify edges outgoing from heavy nodes into three types, and deal with each type separately:

1

from (any) heavy node to a light node,

2

from a nonbranching heavy node to (any) heavy node,

3

from a branching heavy node to (any) heavy node, At most one such edge per node, can be stored separately.

Paweł Gawrychowski String indexing in the Word RAM model III 27 / 30

slide-59
SLIDE 59

We classify edges outgoing from heavy nodes into three types, and deal with each type separately:

1

from (any) heavy node to a light node,

2

from a nonbranching heavy node to (any) heavy node,

3

from a branching heavy node to (any) heavy node, The total number of such edges is just n

s, hence we can afford the

super-linear construction time. More precisely, we compute perfect hashing functions for each such node separately in O(k log2 log k) = O(k log2 log σ) = O(ks) time, which takes O( n

ss) = O(n) time in total.

Paweł Gawrychowski String indexing in the Word RAM model III 27 / 30

slide-60
SLIDE 60

We classify edges outgoing from heavy nodes into three types, and deal with each type separately:

1

from (any) heavy node to a light node,

2

from a nonbranching heavy node to (any) heavy node,

3

from a branching heavy node to (any) heavy node, We store all such edges in a predecessor structure. By combining the perfect hashing result and the classical x-fast trees by Willard, there exists a linear-size predecessor structure with O(log log σ) query time, which can be constructed in linear time.

Paweł Gawrychowski String indexing in the Word RAM model III 27 / 30

slide-61
SLIDE 61

Observe that any navigation query traverses an edge of type (1) at most once, hence we pay O(log log σ) just once (so far). But what happens when we reach a light node? Each light node contains at most s leaves. We can execute a binary search over those leaves using the suffix array trick, namely in each step we achieve at least one of the following:

1

halve the current interval,

2

consume one character from the pattern. Hence in O(m + log s) time we can locate the predecessor of the pattern among all leaves, and the search actually computes the longest prefix of the pattern which is a prefix of a string corresponding to some leaf.

Paweł Gawrychowski String indexing in the Word RAM model III 28 / 30

slide-62
SLIDE 62

The total time complexity for a query is O(m + log log σ + log s) = O(m + log log σ) and the total construction time is linear.

Paweł Gawrychowski String indexing in the Word RAM model III 29 / 30

slide-63
SLIDE 63

Questions?

Paweł Gawrychowski String indexing in the Word RAM model III 30 / 30