Overview Monte Carlo LCE structure Deterministic data structure
In-place Longest Common Extensions Nicola Prezza University of - - PowerPoint PPT Presentation
In-place Longest Common Extensions Nicola Prezza University of - - PowerPoint PPT Presentation
Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured
Overview Monte Carlo LCE structure Deterministic data structure
Longest Common Extension queries 0 1 2 3 4 5 6 7 8 9 T = a a b a b a b a a b LCE(1, 5) = 3
Overview Monte Carlo LCE structure Deterministic data structure
State of the art
Space (bits) Query time build time Reference O(n log n) O(1) O(n) ST + LCA O(n log n) O(1) O(n) RMQ + LCP n⌈log2 σ⌉ + O(nw/τ) O(τ) O(n2+ǫ) [Bille2015] n⌈log2 σ⌉ + O(nw/τ) O(τ) O(n3/2) exp. [Bille2015] n⌈log2 σ⌉ + O(nw/τ) O(τ log τ) O(nτ) [Tanimura2016] n⌈log2 σ⌉ O(ℓ) — store only T ℓ = LCE(i, j)
Overview Monte Carlo LCE structure Deterministic data structure
Result presented Deterministic data structure of size n⌈log2 σ⌉ bits supporting
- ptimal O(m log σ/w)-time extraction of T[i, . . . , i + m − 1]
O(log2 ℓ) LCE queries Construction: O(n log n) expected time and O(n) words of space in-place data structure: no little-o terms LCE improvable to O(log ℓ) using O(log n) words of additional space
Overview Monte Carlo LCE structure Deterministic data structure
Applications In-place algorithms to Compute Suffix array in O(n log2 n) exp time (exact) Compute LCP array in O(n log2 n) exp time (exact) Sparse suffix sorting (Monte Carlo)
Overview Monte Carlo LCE structure Deterministic data structure
Steps Replace text with Karp-Rabin fingerprints of a subset of its prefixes Choose randomly the modulo q in such a way that we can statistically compress fingerprints down to n⌈log2 σ⌉ bits De-randomize For simplicity, only binary case σ = 2 considered here. Easy to extend to σ ∈ O(w)
Overview Monte Carlo LCE structure Deterministic data structure 1
Choose a block size τ ∈ Θ(w)
2
Choose a τ-bits random prime q (modulo of KR function)
3
Chose uniform seed ¯ s ∈ [0, q)
4
Left-pad T with ¯ s
5
Break text in τ-bits blocks: array B[1, . . . , n/τ] of τ-bits integers Example τ = 5, q = 10001 (= 17), ¯ s = 00101 B = 00101 01011 11010 10101 11010 00001
Overview Monte Carlo LCE structure Deterministic data structure
Build array P’ of Karp-Rabin fingerprints of prefixes ending at block boundaries add bitvector D[1, . . . , n/τ] marking P′[i] ≥ q Example τ = 5, q = 10001 (= 17) B = 00101 01011 11010 10101 11010 00001 P’ = 01101 10010 01110 10101 00101 01011 D = 1 1 Property 1 With P’ and D we can recover B (therefore T):
1
If B[i] < q: B[i] = P′[i] − 2τ · P′[i − 1] mod q
2
If B[i] ≥ q the following holds: B[i] mod q = B[i] − q ⇒ add q to the value in (1)
Overview Monte Carlo LCE structure Deterministic data structure
P’ and D take n + n/τ bits of space and support: Optimal-time text extraction Computation of Karp-Rabin fingerprint of any text substring ⇒ LCE queries in O(log ℓ) steps of exponential+binary searcha
aO(log2 ℓ) total time because we need to compute powers of 2 mod q
Can we reduce space to n bits?
Overview Monte Carlo LCE structure Deterministic data structure
Idea Pick q in such a way that few P′[i] start with a 1 Property: each P′[i] is a uniform number in [0, q) (thanks to the seed)
Combinations of block values with τ = 4. q = 1011 (= 11) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 = q 1100 1101 1110 1111
P(P′[i] begins with 1) = red/(red + black) = (q − 2τ−1)/q
Overview Monte Carlo LCE structure Deterministic data structure
Goal: few P′[i] starting with 1. Solve (q − 2τ−1)/q ≤ 1/n Result Pick q uniformly from Z =
- 2τ−1, 2τ−1
- n
n − 1
Overview Monte Carlo LCE structure Deterministic data structure
Final step: build array P by removing first bit from each P′[i], store ranks of P′-blocks starting with 1 in an array S Example τ = 5 P = 1101 0010 1110 0101 0101 1011 D = 1 1 S = {2, 4} E[|S| + |P| + |D|] = n + O(w) bits Construction Pick pairs q, ¯ s until overall size is n bits (+ O(1) words) ⇒ O(n) exp construction time τ = (8 + c)w for any constant c (see why in the paper:) LCE failure probability ≤ n−c (proof in paper)
Overview Monte Carlo LCE structure Deterministic data structure
In-place construction We can replace T with our structure in O(n) expected time while using
- nly O(1) additional words of working space. Construction can be
inverted in the same space/time (restoring text)
Overview Monte Carlo LCE structure Deterministic data structure
Applications
Suffix sorting Easy to lexicographically compare two text suffixes using LCE queries Result 1: in-place sparse suffix sorting Any set S = {i1, . . . , ib} of b suffixes of a text T ∈ Σn can be sorted correctly with high probability in O(n + b log b · log2 n) expected time using O(1) words of space on top of T and S
Overview Monte Carlo LCE structure Deterministic data structure
Important: while computing LCE queries, in exponential/binary searches we
- nly compare (fingerprints of) text substrings of length 2e
Theorem 1 In O(n log n) expected time and O(n) words of space we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2e, for all 0 ≤ e ≤ log2 n Theorem 2 In O(n log2 n) worst-case time and n words of space (on top of T) we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2e, for all 0 ≤ e ≤ log2 n ⇒ our deterministic structure can be built in O(n log n) exp time and linear space
Overview Monte Carlo LCE structure Deterministic data structure