An opportunistic text indexing structure based on run length - - PowerPoint PPT Presentation

an opportunistic text indexing structure based on run
SMART_READER_LITE
LIVE PREVIEW

An opportunistic text indexing structure based on run length - - PowerPoint PPT Presentation

CIAC 2015 An opportunistic text indexing structure based on run length encoding Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Kyushu University, Japan Kyushu University, Japan Kyushu


slide-1
SLIDE 1

CIAC 2015

An opportunistic text indexing structure based on run length encoding

Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

slide-2
SLIDE 2

Kyushu University, Japan

slide-3
SLIDE 3

Kyushu University, Japan

Kyushu U.

Kyushu U.

slide-4
SLIDE 4

Kyushu University, Japan

Kyushu U.

Itoshima Peninsula

糸島 String Island

slide-5
SLIDE 5

String matching

Input: text string T and pattern string P Output: all occurrences of P in T

slide-6
SLIDE 6

String matching

Input: text string T and pattern string P Output: all occurrences of P in T

compress

pattern P text T

We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. We introduce a general framework which is suitable to capture an essence of compr

  • mpress

essed pattern matching according to various dictionary based comp

  • mpres
  • ressions. The

goal is to find all occurrences of a pattern in a text without decompre mpress ssion, which is one of the most active topics in string matching. Our framework includes such compre

  • mpress

ssion methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method.

slide-7
SLIDE 7

String matching

Input: text string T and pattern string P Output: all occurrences of P in T

 String matching is fundamental to areas such as

  • Information Retrieval
  • Bioinformatics, etc.
slide-8
SLIDE 8

Indexed string matching

Preprocess: build index on fixed text T Query: pattern string P Answer: all occurrences of P in T

 Goal is to construct a space ce-effi effici cient ent index on T which quick ckly ly answers to string matching query.

  • Text T can be very long (e.g., DNA sequences).
  • We may receive many different query patterns.
slide-9
SLIDE 9

Classical text index: Suffix Array

The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].

T = cococacao$

$ is an end-marker

which appears only at the end of any string.

slide-10
SLIDE 10

Classical text index: Suffix Array

The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

1 2 3 4 5 6 7 8 9 10

slide-11
SLIDE 11

Classical text index: Suffix Array

The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$ cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

1 2 3 4 5 6 7 8 9 10 10 6 8 5 7 3 1 9 4 2 Sort

SA

slide-12
SLIDE 12

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA P = coc

slide-13
SLIDE 13

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA P = coc cao$ >

slide-14
SLIDE 14

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA P = coc

  • $

<

slide-15
SLIDE 15

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA P = coc cocacao$ =

slide-16
SLIDE 16

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA P = coc cococacao$ =

✔ ✔

slide-17
SLIDE 17

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

Binary search a given pattern P on SA

✔ ✔

T = cococacao$

1 2 3 4 5 6 7 8 9 10

✔ ✔

P = coc

slide-18
SLIDE 18

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2

SA

String matching with suffix array

All occurrences of P in T can be found in O(mlogu+occ) time using SA. The search time can be improved to O(m+logu+occ) using the LCP array. u

u = |T| m = |P|

  • cc = # occ. of P in T
slide-19
SLIDE 19

SA+LCP

There is an index (SA+LCP) which reports all occ occurrences of P in T in O(m+logu+occ) time, and requires 2ulogu + ulogσ + O(u) bits of space.

SA & LCP Text T Auxiliary data structure

 This can take too much space for large text T (i.e., for large u).

Theorem [Manber & Myers, 1991]

u = |T| m = |P| σ = |S|

slide-20
SLIDE 20

Compressed index

 There are a number of compressed indexes which occupy only compressed size of text.

  • FM-index [Ferragina & Mancini, 2000],

Compressed Suffix Array [Grossi & Vitter, 2000], Lempel-Ziv index [Gagie et al., 2014], etc.

 Most of them are slower

  • wer than SA+LCP.

New compressed index based on run length encoding (RLE) of text which is small ller er & faste ter than SA+LCP. Our proposal

slide-21
SLIDE 21

Run Length Encoding (RLE)

The run length encoding of text T, denoted RLE(T), is a compressed representation of T in which each maximal run a…a of characters is encoded by a p, where p denotes the length of the maximal run.  Applications to RLE include:

  • black-white fax messages
  • image format (PackBits, TIFF)
  • music format (MIDI)

T = aaaabbbaacccccccbbbbbaaaaa$ RLE(T) = a4b3a2c7b5a5$

slide-22
SLIDE 22

RLE suffixes

Let n = |RLE(T)|. For any 1 ≤ i ≤ n, RLEsuf(i) is the suffix of RLE(T) starting with the i-th run.

a4b3a2c7b5a5$ b3a2c7b5a5$ c7b5a5$ b5a5$ a5$ $

RLEsuf(1): RLEsuf(2): RLEsuf(3): RLEsuf(4): RLEsuf(5): RLEsuf(6): RLE(T): a4b3a2c7b5a5$

a2c7b5a5$

RLEsuf(7):

n = 7

slide-23
SLIDE 23

Difficulty in indexing RLE suffixes

 We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!

a5b... a5b... a5c... a4b... a4c... a4c... a4c... a3b...

sorted RLE suffixes of text

a3b...

slide-24
SLIDE 24

Difficulty in indexing RLE suffixes

 We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!

aaaaab... aaaaab... aaaaac... aaaab... aaaac... aaaac... aaaac... aaab...

sorted RLE suffixes of text

aaab...

slide-25
SLIDE 25

Difficulty in indexing RLE suffixes

 We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!

aaaaab... aaaaab... aaaaac... aaaab... aaaac... aaaac... aaaac... aaab... aaab...

sorted RLE suffixes of text

RLE(P): a2b1

✔ ✔ ✔ ✔ ✔

Pattern occurrences are spread out, so we cannot binary search!!

slide-26
SLIDE 26

Our ideas to index RLE suffixes

 When sorting RLE suffixes, we “ignore” the exponents of the first runs of RLE suffixes of text T.  To find occurrences of pattern P, we first “ignore” the exponent of the first run of RLE(P), and find its corresponding range.  We then pick up only the occurrences of RLE(P) from this range.

slide-27
SLIDE 27

Truncated RLE suffixes

tRLEsuf(i) is the suffix of RLEsuf(i) where the first exponent pi is truncated to 1.

a1b3a2c7b5a5$ b1a2c7b5a5$ c1b5a5$ b1a5$ a1$ $

tRLEsuf(1): tRLEsuf(2): tRLEsuf(3): tRLEsuf(4): tRLEsuf(5): tRLEsuf(6):

a1c7b5a5$

tRLEsuf(7):

a4b3a2c7b5a5$ b3a2c7b5a5$ c7b5a5$ b5a5$ a5$ $

RLEsuf(1): RLEsuf(2): RLEsuf(3): RLEsuf(4): RLEsuf(5): RLEsuf(6):

a2c7b5a5$

RLEsuf(7):

slide-28
SLIDE 28

Our index: Truncated RLE Suffix Array

The tRLE suffix array tRLESA of text T is an array which stores the beginning positions of the tRLE suffixes in lexicographical order.

a1b3a2c7b5a5$ b1a2c7b5a5$ c1b5a5$ b1a5$ a1$ $ a1c7b5a5$ $ a1$ a1b3a2c7b5a5$ a1c7b5a5$ b1a5$ b1a2c7b5a5$ c1b5a5$

1 2 3 4 5 6 7 Sort

tRLESA

7 6 1 3 5 2 4

slide-29
SLIDE 29

Monotonicity on Truncated RLE Suffix Array b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Ignored exponents in parentheses

slide-30
SLIDE 30

Monotonicity on Truncated RLE Suffix Array b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes The range

bc5a2matches

RLE(P): b3c5a2b4

We first look for bc5a2 This range can be found by a binary search.

slide-31
SLIDE 31

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

...

tRLESA

47 99 11 40 55 72 19 26 4

...

tRLE suffixes RLE(P): b3c5a2b4

Monotonicity on Truncated RLE Suffix Array

... ...

We next look for bc5a2b4

The range

bc5a2matches

slide-32
SLIDE 32

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

...

tRLESA

47 99 11 40 55 72 19 26 4

...

tRLE suffixes RLE(P): b3c5a2b4

monotonically non-decreasing monotonically non-increasing

Monotonicity on Truncated RLE Suffix Array

... ...

We next look for bc5a2b4

The range

bc5a2matches

slide-33
SLIDE 33

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes the range

bc5a2b4 matches

Matching with Truncated RLE Suffix Array

Based on the monotonicity, this range can be found by a binary search.

RLE(P): b3c5a2b4

We next look for bc5a2b4

slide-34
SLIDE 34

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

RLE(P): b3c5a2b4

We finally look for b3c5a2b4

the range

bc5a2b4 matches

slide-35
SLIDE 35

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

RLE(P): b3c5a2b4

We want only those whose1st exponents are at least 3

We finally look for b3c5a2b4

the range

bc5a2b4 matches

slide-36
SLIDE 36

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes RLE(P): b3c5a2b4

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents

We use an array of ignored exponents

  • f the first runs.
slide-37
SLIDE 37

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents

We finally look for b3c5a2b4

the range

bc5a2b4 matches

RLE(P): b3c5a2b4

slide-38
SLIDE 38

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents Range Maximum Query (RMQ)

We finally look for b3c5a2b4

RLE(P): b3c5a2b4

slide-39
SLIDE 39

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents RMQ

RMQ

✔ ✔

We finally look for b3c5a2b4

RLE(P): b3c5a2b4

We perform RMQ’s recursively, in the 1st & 2nd halves of the range.

slide-40
SLIDE 40

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents

RMQ

✔ ✔

RMQ

We finally look for b3c5a2b4

RLE(P): b3c5a2b4

Recursion ends when the range maxima is less than 3.

slide-41
SLIDE 41

b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...

... ...

tRLESA

47 99 11 40 55 72 19 26 4

... ...

tRLE suffixes RLE(P): b3c5a2b4

Matching with Truncated RLE Suffix Array

2 9 1 2 3 9 1 5 1

... ...

exponents

✔ ✔ ✔

# of RMQ’s we perform is O(occ). Each RMQ takes O(1) time

[Fischer & Heum, 2011].

slide-42
SLIDE 42

Our results

There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.

Theorem 1 (RLE-index)

u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|

slide-43
SLIDE 43

Our results

There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.

Theorem 1 (RLE-index)  SA+LCP takes O(m+logu+occ) time for pattern matching ( m = |P| ).  Since q ≤ m and n ≤ u always hold,

  • ur index is faster

er than SA+LCP.

u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|

slide-44
SLIDE 44

Our results

There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.

u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|

Theorem 1 (RLE-index)  SA+LCP requires 2ulogu + ulogσ + O(u) bits of space.  Our RLE-index is smalle ler when text T is compressible with RLE.

slide-45
SLIDE 45

Our results

Given RLE(T) of size n, the RLE-index of T can be constructed in O(nlogn) time with O(nlogu) bits of working space.

u = |T| n = |RLE(T)|

Theorem 2 (Construction time & space)

 We introduced new combinatorial properties of RLE suffixes.  We also use the idea of induced-sorting [Nong et al., 2011] which was originally designed for fast suffix array construction.

slide-46
SLIDE 46

Conclusions & Future work

 Our RLE-index is always faster than SA+LCP.  Our RLE-index is smaller than SA+LCP when the text is compressible by RLE (i.e. when the nlogn term is negligible).  Comparisons to other compressed index (e.g., FM-index, compressed SA, LZ-index).

slide-47
SLIDE 47

FAQ

slide-48
SLIDE 48

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2 SA

LCP array and Range Minima

The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.

  • 1

2 1 3 1 2 LCP

slide-49
SLIDE 49

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2 SA

LCP array and Range Minima

The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.

  • 1

2 1 3 1 2 LCP

slide-50
SLIDE 50

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

10 6 8 5 7 3 1 9 4 2 SA

LCP array and Range Minima

The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.

  • 1

2 1 3 1 2 LCP

slide-51
SLIDE 51

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

LCP array and Range Minima

The length of the LCP of any suffixes can also be computed by a range minimum query.

  • 1

2 1 3 1 2 LCP

✔ ✔

slide-52
SLIDE 52

cococacao$

  • cocacao$

cocacao$

  • cacao$

cacao$ acao$ cao$ ao$

  • $

$

LCP array and Range Minima

The length of the LCP of any suffixes can also be computed by a range minimum query.

  • 1

2 1 3 1 2 LCP

✔ ✔

Range minimum query

slide-53
SLIDE 53

LCP array and Range Minima

For any integer array of length k, there is a data structure which supports range minimum query in O(1) time, and requires 2k + o(k) bits

  • f extra space [Fischer & Heum, 2011].
slide-54
SLIDE 54

S-type and L-type RLE suffixes

RLEsuf(i) is S-type if RLEsuf(i) < RLEsuf(i+1). RLEsuf(i) is L-type if RLEsuf(i) > RLEsuf(i+1).  a4b3a2c7b5a5$ is S-type, because a4b3a2c7b5a5$ < b3a2c7b5a5$.  b3a2c7b5a5$ is L-type, because b3a2c7b5a5$ < a2c7b5a5$.

※ Lex. order < on RLE strings is the same as the lex. order < on decompressed strings.

slide-55
SLIDE 55

Properties of lex. order of RLE suffixes

For any RLEsuf(i) and RLEsuf(j) with ai = aj ,

  • 1. if RLEsuf(i) is L-type and RLEsuf(j) is S-type,

then RLEsuf(i) < RLEsuf(j).

  • 2. if RLEsuf(i) and RLEsuf(j) are L-type and pi < pi ,

then RLEsuf(i) < RLEsuf(j).

  • 3. if RLEsuf(i) and RLEsuf(j) are S-type and pi > pi ,

then RLEsuf(i) < RLEsuf(j). Lemma For any 1 ≤ i ≤ n, let ai, pi be the ith character and exponent of RLE(T), respectively.

slide-56
SLIDE 56

Properties of lex. order of RLE suffixes

For any RLEsuf(i) and RLEsuf(j) with ai = aj ,

  • 1. if RLEsuf(i) is L-type and RLEsuf(j) is S-type,

then RLEsuf(i) < RLEsuf(j). Lemma (Case 1)

a5$ < a4b3a2c7b5a5$

L-type (a > $) S-type (a < b)

aaaabbbaacccccccbbbbbaaaaa$ aaaaa$ <

slide-57
SLIDE 57

Properties of lex. order of RLE suffixes

For any RLEsuf(i) and RLEsuf(j) with ai = aj ,

  • 2. if RLEsuf(i) and RLEsuf(j) are L-type and pi < pi ,

then RLEsuf(i) < RLEsuf(j). Lemma (Case 2)

b3a2c7b5a5$ < b5a5$

L-type (b > a)

bbbaacccccccbbbbbaaaaa$ <

L-type (b > a)

bbbbbaaaaa$

slide-58
SLIDE 58

Properties of lex. order of RLE suffixes

For any RLEsuf(i) and RLEsuf(j) with ai = aj ,

  • 3. if RLEsuf(i) and RLEsuf(j) are S-type and pi > pi ,

then RLEsuf(i) < RLEsuf(j). Lemma (Case 3)

a4b3a2c7b5a5$ < a2c7b5a5$

S-type (a < b)

aaaabbbaacccccccbbbbbaaaaa$ <

S-type (a < c)

aacccccccbbbbbaaaaa$

slide-59
SLIDE 59

Our results

There is an index which, given an integer 1 ≤ j ≤ u, answers SA[j] in O(log2n) time, and requires n(3logu + logn + logσ) + 2σlog

𝑣 σ + O(nloglogn)

bits of space.

u = |T| n = |RLE(T)| σ = |S|

Theorem 3 (accessing SA)

 Use a wavelet tree [Grossi et al., 2003] in place of RMQ data structure.  Then, we can access arbitrary position

  • f SA, using our RLE-index.