CIAC 2015
An opportunistic text indexing structure based on run length encoding
Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan
An opportunistic text indexing structure based on run length - - PowerPoint PPT Presentation
CIAC 2015 An opportunistic text indexing structure based on run length encoding Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Kyushu University, Japan Kyushu University, Japan Kyushu
CIAC 2015
An opportunistic text indexing structure based on run length encoding
Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan
Kyushu U.
Kyushu U.
Kyushu U.
Itoshima Peninsula
Input: text string T and pattern string P Output: all occurrences of P in T
Input: text string T and pattern string P Output: all occurrences of P in T
compress
pattern P text T
We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. We introduce a general framework which is suitable to capture an essence of compr
essed pattern matching according to various dictionary based comp
goal is to find all occurrences of a pattern in a text without decompre mpress ssion, which is one of the most active topics in string matching. Our framework includes such compre
ssion methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method.
Input: text string T and pattern string P Output: all occurrences of P in T
String matching is fundamental to areas such as
Preprocess: build index on fixed text T Query: pattern string P Answer: all occurrences of P in T
Goal is to construct a space ce-effi effici cient ent index on T which quick ckly ly answers to string matching query.
The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].
T = cococacao$
$ is an end-marker
which appears only at the end of any string.
The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
1 2 3 4 5 6 7 8 9 10
The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991].
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$ cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
1 2 3 4 5 6 7 8 9 10 10 6 8 5 7 3 1 9 4 2 Sort
SA
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA P = coc
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA P = coc cao$ >
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA P = coc
<
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA P = coc cocacao$ =
✔
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA P = coc cococacao$ =
✔ ✔
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
Binary search a given pattern P on SA
✔ ✔
T = cococacao$
1 2 3 4 5 6 7 8 9 10
✔ ✔
P = coc
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2
SA
All occurrences of P in T can be found in O(mlogu+occ) time using SA. The search time can be improved to O(m+logu+occ) using the LCP array. u
u = |T| m = |P|
There is an index (SA+LCP) which reports all occ occurrences of P in T in O(m+logu+occ) time, and requires 2ulogu + ulogσ + O(u) bits of space.
SA & LCP Text T Auxiliary data structure
This can take too much space for large text T (i.e., for large u).
Theorem [Manber & Myers, 1991]
u = |T| m = |P| σ = |S|
There are a number of compressed indexes which occupy only compressed size of text.
Compressed Suffix Array [Grossi & Vitter, 2000], Lempel-Ziv index [Gagie et al., 2014], etc.
Most of them are slower
New compressed index based on run length encoding (RLE) of text which is small ller er & faste ter than SA+LCP. Our proposal
The run length encoding of text T, denoted RLE(T), is a compressed representation of T in which each maximal run a…a of characters is encoded by a p, where p denotes the length of the maximal run. Applications to RLE include:
T = aaaabbbaacccccccbbbbbaaaaa$ RLE(T) = a4b3a2c7b5a5$
Let n = |RLE(T)|. For any 1 ≤ i ≤ n, RLEsuf(i) is the suffix of RLE(T) starting with the i-th run.
a4b3a2c7b5a5$ b3a2c7b5a5$ c7b5a5$ b5a5$ a5$ $
RLEsuf(1): RLEsuf(2): RLEsuf(3): RLEsuf(4): RLEsuf(5): RLEsuf(6): RLE(T): a4b3a2c7b5a5$
a2c7b5a5$
RLEsuf(7):
n = 7
We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!
a5b... a5b... a5c... a4b... a4c... a4c... a4c... a3b...
sorted RLE suffixes of text
a3b...
We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!
aaaaab... aaaaab... aaaaac... aaaab... aaaac... aaaac... aaaac... aaab...
sorted RLE suffixes of text
aaab...
We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work!
aaaaab... aaaaab... aaaaac... aaaab... aaaac... aaaac... aaaac... aaab... aaab...
sorted RLE suffixes of text
RLE(P): a2b1
✔ ✔ ✔ ✔ ✔
Pattern occurrences are spread out, so we cannot binary search!!
When sorting RLE suffixes, we “ignore” the exponents of the first runs of RLE suffixes of text T. To find occurrences of pattern P, we first “ignore” the exponent of the first run of RLE(P), and find its corresponding range. We then pick up only the occurrences of RLE(P) from this range.
tRLEsuf(i) is the suffix of RLEsuf(i) where the first exponent pi is truncated to 1.
a1b3a2c7b5a5$ b1a2c7b5a5$ c1b5a5$ b1a5$ a1$ $
tRLEsuf(1): tRLEsuf(2): tRLEsuf(3): tRLEsuf(4): tRLEsuf(5): tRLEsuf(6):
a1c7b5a5$
tRLEsuf(7):
a4b3a2c7b5a5$ b3a2c7b5a5$ c7b5a5$ b5a5$ a5$ $
RLEsuf(1): RLEsuf(2): RLEsuf(3): RLEsuf(4): RLEsuf(5): RLEsuf(6):
a2c7b5a5$
RLEsuf(7):
Our index: Truncated RLE Suffix Array
The tRLE suffix array tRLESA of text T is an array which stores the beginning positions of the tRLE suffixes in lexicographical order.
a1b3a2c7b5a5$ b1a2c7b5a5$ c1b5a5$ b1a5$ a1$ $ a1c7b5a5$ $ a1$ a1b3a2c7b5a5$ a1c7b5a5$ b1a5$ b1a2c7b5a5$ c1b5a5$
1 2 3 4 5 6 7 Sort
tRLESA
7 6 1 3 5 2 4
Monotonicity on Truncated RLE Suffix Array b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Ignored exponents in parentheses
Monotonicity on Truncated RLE Suffix Array b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes The range
bc5a2matches
RLE(P): b3c5a2b4
We first look for bc5a2 This range can be found by a binary search.
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
...
tRLESA
47 99 11 40 55 72 19 26 4
...
tRLE suffixes RLE(P): b3c5a2b4
Monotonicity on Truncated RLE Suffix Array
... ...
We next look for bc5a2b4
The range
bc5a2matches
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
...
tRLESA
47 99 11 40 55 72 19 26 4
...
tRLE suffixes RLE(P): b3c5a2b4
monotonically non-decreasing monotonically non-increasing
Monotonicity on Truncated RLE Suffix Array
... ...
We next look for bc5a2b4
The range
bc5a2matches
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes the range
bc5a2b4 matches
Matching with Truncated RLE Suffix Array
Based on the monotonicity, this range can be found by a binary search.
RLE(P): b3c5a2b4
We next look for bc5a2b4
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
RLE(P): b3c5a2b4
We finally look for b3c5a2b4
the range
bc5a2b4 matches
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
RLE(P): b3c5a2b4
We want only those whose1st exponents are at least 3
We finally look for b3c5a2b4
the range
bc5a2b4 matches
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes RLE(P): b3c5a2b4
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents
We use an array of ignored exponents
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents
We finally look for b3c5a2b4
the range
bc5a2b4 matches
RLE(P): b3c5a2b4
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents Range Maximum Query (RMQ)
✔
We finally look for b3c5a2b4
RLE(P): b3c5a2b4
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents RMQ
✔
RMQ
✔ ✔
We finally look for b3c5a2b4
RLE(P): b3c5a2b4
We perform RMQ’s recursively, in the 1st & 2nd halves of the range.
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents
✔
RMQ
✔ ✔
RMQ
We finally look for b3c5a2b4
RLE(P): b3c5a2b4
Recursion ends when the range maxima is less than 3.
b(2)c5a2 b2a6... b(9)c5a2 b3a1... b(2)c5a2 b7a3... b(3)c5a2 b8c2... b(9)c5a2 b6c3... b(1)c5a2 b6c3... b(5)c5a2 b4c7... b(1)c5a2 b5a4... b(1)c5a2 b1c8...
... ...
tRLESA
47 99 11 40 55 72 19 26 4
... ...
tRLE suffixes RLE(P): b3c5a2b4
Matching with Truncated RLE Suffix Array
2 9 1 2 3 9 1 5 1
... ...
exponents
✔ ✔ ✔
# of RMQ’s we perform is O(occ). Each RMQ takes O(1) time
[Fischer & Heum, 2011].
There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.
Theorem 1 (RLE-index)
u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|
There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.
Theorem 1 (RLE-index) SA+LCP takes O(m+logu+occ) time for pattern matching ( m = |P| ). Since q ≤ m and n ≤ u always hold,
er than SA+LCP.
u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|
There is an index which, given RLE(P), reports all occ occurrences of P in T in O(q+logn+occ) time, and requires 2nlogu + nlogσ + nlogn + O(n) bits of space.
u = |T| n = |RLE(T)| (n ≤ u) q = |RLE(P)| σ = |S|
Theorem 1 (RLE-index) SA+LCP requires 2ulogu + ulogσ + O(u) bits of space. Our RLE-index is smalle ler when text T is compressible with RLE.
Given RLE(T) of size n, the RLE-index of T can be constructed in O(nlogn) time with O(nlogu) bits of working space.
u = |T| n = |RLE(T)|
Theorem 2 (Construction time & space)
We introduced new combinatorial properties of RLE suffixes. We also use the idea of induced-sorting [Nong et al., 2011] which was originally designed for fast suffix array construction.
Our RLE-index is always faster than SA+LCP. Our RLE-index is smaller than SA+LCP when the text is compressible by RLE (i.e. when the nlogn term is negligible). Comparisons to other compressed index (e.g., FM-index, compressed SA, LZ-index).
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2 SA
The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.
2 1 3 1 2 LCP
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2 SA
The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.
2 1 3 1 2 LCP
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
10 6 8 5 7 3 1 9 4 2 SA
The LCP array of T stores the length of the longest common prefix of neighboring suffixes in SA of T.
2 1 3 1 2 LCP
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
The length of the LCP of any suffixes can also be computed by a range minimum query.
2 1 3 1 2 LCP
✔ ✔
cococacao$
cocacao$
cacao$ acao$ cao$ ao$
$
The length of the LCP of any suffixes can also be computed by a range minimum query.
2 1 3 1 2 LCP
✔ ✔
Range minimum query
For any integer array of length k, there is a data structure which supports range minimum query in O(1) time, and requires 2k + o(k) bits
RLEsuf(i) is S-type if RLEsuf(i) < RLEsuf(i+1). RLEsuf(i) is L-type if RLEsuf(i) > RLEsuf(i+1). a4b3a2c7b5a5$ is S-type, because a4b3a2c7b5a5$ < b3a2c7b5a5$. b3a2c7b5a5$ is L-type, because b3a2c7b5a5$ < a2c7b5a5$.
※ Lex. order < on RLE strings is the same as the lex. order < on decompressed strings.
Properties of lex. order of RLE suffixes
For any RLEsuf(i) and RLEsuf(j) with ai = aj ,
then RLEsuf(i) < RLEsuf(j).
then RLEsuf(i) < RLEsuf(j).
then RLEsuf(i) < RLEsuf(j). Lemma For any 1 ≤ i ≤ n, let ai, pi be the ith character and exponent of RLE(T), respectively.
Properties of lex. order of RLE suffixes
For any RLEsuf(i) and RLEsuf(j) with ai = aj ,
then RLEsuf(i) < RLEsuf(j). Lemma (Case 1)
a5$ < a4b3a2c7b5a5$
L-type (a > $) S-type (a < b)
aaaabbbaacccccccbbbbbaaaaa$ aaaaa$ <
Properties of lex. order of RLE suffixes
For any RLEsuf(i) and RLEsuf(j) with ai = aj ,
then RLEsuf(i) < RLEsuf(j). Lemma (Case 2)
b3a2c7b5a5$ < b5a5$
L-type (b > a)
bbbaacccccccbbbbbaaaaa$ <
L-type (b > a)
bbbbbaaaaa$
Properties of lex. order of RLE suffixes
For any RLEsuf(i) and RLEsuf(j) with ai = aj ,
then RLEsuf(i) < RLEsuf(j). Lemma (Case 3)
a4b3a2c7b5a5$ < a2c7b5a5$
S-type (a < b)
aaaabbbaacccccccbbbbbaaaaa$ <
S-type (a < c)
aacccccccbbbbbaaaaa$
There is an index which, given an integer 1 ≤ j ≤ u, answers SA[j] in O(log2n) time, and requires n(3logu + logn + logσ) + 2σlog
𝑣 σ + O(nloglogn)
bits of space.
u = |T| n = |RLE(T)| σ = |S|
Theorem 3 (accessing SA)
Use a wavelet tree [Grossi et al., 2003] in place of RMQ data structure. Then, we can access arbitrary position