In-place Longest Common Extensions Nicola Prezza University of - - PowerPoint PPT Presentation

in place longest common extensions
SMART_READER_LITE
LIVE PREVIEW

In-place Longest Common Extensions Nicola Prezza University of - - PowerPoint PPT Presentation

Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured


slide-1
SLIDE 1

Overview Monte Carlo LCE structure Deterministic data structure

In-place Longest Common Extensions

Nicola Prezza

University of Udine, department of Computer Science

Dagstuhl Seminar 16431: "Computation over Compressed Structured Data"

slide-2
SLIDE 2

Overview Monte Carlo LCE structure Deterministic data structure

Longest Common Extension queries 0 1 2 3 4 5 6 7 8 9 T = a a b a b a b a a b LCE(1, 5) = 3

slide-3
SLIDE 3

Overview Monte Carlo LCE structure Deterministic data structure

State of the art

Space (bits) Query time build time Reference O(n log n) O(1) O(n) ST + LCA O(n log n) O(1) O(n) RMQ + LCP n⌈log2 σ⌉ + O(nw/τ) O(τ) O(n2+ǫ) [Bille2015] n⌈log2 σ⌉ + O(nw/τ) O(τ) O(n3/2) exp. [Bille2015] n⌈log2 σ⌉ + O(nw/τ) O(τ log τ) O(nτ) [Tanimura2016] n⌈log2 σ⌉ O(ℓ) — store only T ℓ = LCE(i, j)

slide-4
SLIDE 4

Overview Monte Carlo LCE structure Deterministic data structure

Result presented Deterministic data structure of size n⌈log2 σ⌉ bits supporting

  • ptimal O(m log σ/w)-time extraction of T[i, . . . , i + m − 1]

O(log2 ℓ) LCE queries Construction: O(n log n) expected time and O(n) words of space in-place data structure: no little-o terms LCE improvable to O(log ℓ) using O(log n) words of additional space

slide-5
SLIDE 5

Overview Monte Carlo LCE structure Deterministic data structure

Applications In-place algorithms to Compute Suffix array in O(n log2 n) exp time (exact) Compute LCP array in O(n log2 n) exp time (exact) Sparse suffix sorting (Monte Carlo)

slide-6
SLIDE 6

Overview Monte Carlo LCE structure Deterministic data structure

Steps Replace text with Karp-Rabin fingerprints of a subset of its prefixes Choose randomly the modulo q in such a way that we can statistically compress fingerprints down to n⌈log2 σ⌉ bits De-randomize For simplicity, only binary case σ = 2 considered here. Easy to extend to σ ∈ O(w)

slide-7
SLIDE 7

Overview Monte Carlo LCE structure Deterministic data structure 1

Choose a block size τ ∈ Θ(w)

2

Choose a τ-bits random prime q (modulo of KR function)

3

Chose uniform seed ¯ s ∈ [0, q)

4

Left-pad T with ¯ s

5

Break text in τ-bits blocks: array B[1, . . . , n/τ] of τ-bits integers Example τ = 5, q = 10001 (= 17), ¯ s = 00101 B = 00101 01011 11010 10101 11010 00001

slide-8
SLIDE 8

Overview Monte Carlo LCE structure Deterministic data structure

Build array P’ of Karp-Rabin fingerprints of prefixes ending at block boundaries add bitvector D[1, . . . , n/τ] marking P′[i] ≥ q Example τ = 5, q = 10001 (= 17) B = 00101 01011 11010 10101 11010 00001 P’ = 01101 10010 01110 10101 00101 01011 D = 1 1 Property 1 With P’ and D we can recover B (therefore T):

1

If B[i] < q: B[i] = P′[i] − 2τ · P′[i − 1] mod q

2

If B[i] ≥ q the following holds: B[i] mod q = B[i] − q ⇒ add q to the value in (1)

slide-9
SLIDE 9

Overview Monte Carlo LCE structure Deterministic data structure

P’ and D take n + n/τ bits of space and support: Optimal-time text extraction Computation of Karp-Rabin fingerprint of any text substring ⇒ LCE queries in O(log ℓ) steps of exponential+binary searcha

aO(log2 ℓ) total time because we need to compute powers of 2 mod q

Can we reduce space to n bits?

slide-10
SLIDE 10

Overview Monte Carlo LCE structure Deterministic data structure

Idea Pick q in such a way that few P′[i] start with a 1 Property: each P′[i] is a uniform number in [0, q) (thanks to the seed)

Combinations of block values with τ = 4. q = 1011 (= 11) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 = q 1100 1101 1110 1111

P(P′[i] begins with 1) = red/(red + black) = (q − 2τ−1)/q

slide-11
SLIDE 11

Overview Monte Carlo LCE structure Deterministic data structure

Goal: few P′[i] starting with 1. Solve (q − 2τ−1)/q ≤ 1/n Result Pick q uniformly from Z =

  • 2τ−1, 2τ−1
  • n

n − 1

slide-12
SLIDE 12

Overview Monte Carlo LCE structure Deterministic data structure

Final step: build array P by removing first bit from each P′[i], store ranks of P′-blocks starting with 1 in an array S Example τ = 5 P = 1101 0010 1110 0101 0101 1011 D = 1 1 S = {2, 4} E[|S| + |P| + |D|] = n + O(w) bits Construction Pick pairs q, ¯ s until overall size is n bits (+ O(1) words) ⇒ O(n) exp construction time τ = (8 + c)w for any constant c (see why in the paper:) LCE failure probability ≤ n−c (proof in paper)

slide-13
SLIDE 13

Overview Monte Carlo LCE structure Deterministic data structure

In-place construction We can replace T with our structure in O(n) expected time while using

  • nly O(1) additional words of working space. Construction can be

inverted in the same space/time (restoring text)

slide-14
SLIDE 14

Overview Monte Carlo LCE structure Deterministic data structure

Applications

Suffix sorting Easy to lexicographically compare two text suffixes using LCE queries Result 1: in-place sparse suffix sorting Any set S = {i1, . . . , ib} of b suffixes of a text T ∈ Σn can be sorted correctly with high probability in O(n + b log b · log2 n) expected time using O(1) words of space on top of T and S

slide-15
SLIDE 15

Overview Monte Carlo LCE structure Deterministic data structure

Important: while computing LCE queries, in exponential/binary searches we

  • nly compare (fingerprints of) text substrings of length 2e

Theorem 1 In O(n log n) expected time and O(n) words of space we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2e, for all 0 ≤ e ≤ log2 n Theorem 2 In O(n log2 n) worst-case time and n words of space (on top of T) we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2e, for all 0 ≤ e ≤ log2 n ⇒ our deterministic structure can be built in O(n log n) exp time and linear space

slide-16
SLIDE 16

Overview Monte Carlo LCE structure Deterministic data structure

Applications

in-place SA construction The suffix array SA of T ∈ Σn can be computed in O(n log2 n) expected time using O(1) words of space on top of T and SA. The above does not improve state of the art [Franceschini2007]. The following does: in-place LCP construction The Longest Common Prefix (LCP) array can be computed in O(n log2 n) expected time using O(1) words of space on top of the text and the LCP. Previous fastest in-place LCP array construction algorithm runs in quadratic time.