in place longest common extensions
play

In-place Longest Common Extensions Nicola Prezza University of - PowerPoint PPT Presentation

Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured


  1. Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured Data"

  2. Overview Monte Carlo LCE structure Deterministic data structure Longest Common Extension queries 0 1 2 3 4 5 6 7 8 9 T = a a b a b a b a a b LCE ( 1 , 5 ) = 3

  3. Overview Monte Carlo LCE structure Deterministic data structure State of the art Space (bits) Query time build time Reference O ( n log n ) O ( 1 ) O ( n ) ST + LCA O ( n log n ) O ( 1 ) O ( n ) RMQ + LCP O ( n 2 + ǫ ) n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ ) [Bille2015] O ( n 3 / 2 ) exp. n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ ) [Bille2015] n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ log τ ) O ( n τ ) [Tanimura2016] n ⌈ log 2 σ ⌉ O ( ℓ ) — store only T ℓ = LCE ( i , j )

  4. Overview Monte Carlo LCE structure Deterministic data structure Result presented Deterministic data structure of size n ⌈ log 2 σ ⌉ bits supporting optimal O ( m log σ/ w ) -time extraction of T [ i , . . . , i + m − 1 ] O ( log 2 ℓ ) LCE queries Construction: O ( n log n ) expected time and O ( n ) words of space in-place data structure: no little-o terms LCE improvable to O ( log ℓ ) using O ( log n ) words of additional space

  5. Overview Monte Carlo LCE structure Deterministic data structure Applications In-place algorithms to Compute Suffix array in O ( n log 2 n ) exp time (exact) Compute LCP array in O ( n log 2 n ) exp time (exact) Sparse suffix sorting (Monte Carlo)

  6. Overview Monte Carlo LCE structure Deterministic data structure Steps Replace text with Karp-Rabin fingerprints of a subset of its prefixes Choose randomly the modulo q in such a way that we can statistically compress fingerprints down to n ⌈ log 2 σ ⌉ bits De-randomize For simplicity, only binary case σ = 2 considered here. Easy to extend to σ ∈ O ( w )

  7. Overview Monte Carlo LCE structure Deterministic data structure Choose a block size τ ∈ Θ( w ) 1 Choose a τ -bits random prime q (modulo of KR function) 2 Chose uniform seed ¯ s ∈ [ 0 , q ) 3 Left-pad T with ¯ s 4 Break text in τ -bits blocks: array B [ 1 , . . . , n /τ ] of τ -bits integers 5 Example τ = 5 , q = 10001 (= 17 ) , ¯ s = 00101 B = 00101 01011 11010 10101 11010 00001

  8. Overview Monte Carlo LCE structure Deterministic data structure Build array P’ of Karp-Rabin fingerprints of prefixes ending at block boundaries add bitvector D [ 1 , . . . , n /τ ] marking P ′ [ i ] ≥ q Example τ = 5 , q = 10001 (= 17 ) B = 00101 01011 11010 10101 11010 00001 P’ = 01101 10010 01110 10101 00101 01011 D = 0 1 0 1 0 0 Property 1 With P’ and D we can recover B (therefore T ): If B [ i ] < q : B [ i ] = P ′ [ i ] − 2 τ · P ′ [ i − 1 ] mod q 1 If B [ i ] ≥ q the following holds: B [ i ] mod q = B [ i ] − q 2 ⇒ add q to the value in (1)

  9. Overview Monte Carlo LCE structure Deterministic data structure P’ and D take n + n /τ bits of space and support: Optimal-time text extraction Computation of Karp-Rabin fingerprint of any text substring ⇒ LCE queries in O ( log ℓ ) steps of exponential+binary search a a O ( log 2 ℓ ) total time because we need to compute powers of 2 mod q Can we reduce space to n bits?

  10. Overview Monte Carlo LCE structure Deterministic data structure Idea Pick q in such a way that few P ′ [ i ] start with a 1 Property: each P ′ [ i ] is a uniform number in [ 0 , q ) (thanks to the seed) Combinations of block values with τ = 4 . q = 1011 (= 11 ) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 = q 1100 1101 1110 1111 P ( P ′ [ i ] begins with 1 ) = red / ( red + black ) = ( q − 2 τ − 1 ) / q

  11. Overview Monte Carlo LCE structure Deterministic data structure Goal: few P ′ [ i ] starting with 1. Solve ( q − 2 τ − 1 ) / q ≤ 1 / n Result Pick q uniformly from � � �� n 2 τ − 1 , 2 τ − 1 Z = n − 1

  12. Overview Monte Carlo LCE structure Deterministic data structure Final step: build array P by removing first bit from each P ′ [ i ] , store ranks of P ′ -blocks starting with 1 in an array S Example τ = 5 P = 1101 0010 1110 0101 0101 1011 D = 0 1 0 1 0 0 S = { 2 , 4 } E [ | S | + | P | + | D | ] = n + O ( w ) bits Construction Pick pairs � q , ¯ s � until overall size is n bits (+ O ( 1 ) words) ⇒ O ( n ) exp construction time τ = ( 8 + c ) w for any constant c (see why in the paper:) LCE failure probability ≤ n − c (proof in paper)

  13. Overview Monte Carlo LCE structure Deterministic data structure In-place construction We can replace T with our structure in O ( n ) expected time while using only O ( 1 ) additional words of working space. Construction can be inverted in the same space/time (restoring text)

  14. Overview Monte Carlo LCE structure Deterministic data structure Applications Suffix sorting Easy to lexicographically compare two text suffixes using LCE queries Result 1: in-place sparse suffix sorting Any set S = { i 1 , . . . , i b } of b suffixes of a text T ∈ Σ n can be sorted correctly with high probability in O ( n + b log b · log 2 n ) expected time using O ( 1 ) words of space on top of T and S

  15. Overview Monte Carlo LCE structure Deterministic data structure Important: while computing LCE queries, in exponential/binary searches we only compare (fingerprints of) text substrings of length 2 e Theorem 1 In O ( n log n ) expected time and O ( n ) words of space we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2 e , for all 0 ≤ e ≤ log 2 n Theorem 2 In O ( n log 2 n ) worst-case time and n words of space (on top of T ) we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2 e , for all 0 ≤ e ≤ log 2 n ⇒ our deterministic structure can be built in O ( n log n ) exp time and linear space

  16. Overview Monte Carlo LCE structure Deterministic data structure Applications in-place SA construction The suffix array SA of T ∈ Σ n can be computed in O ( n log 2 n ) expected time using O ( 1 ) words of space on top of T and SA . The above does not improve state of the art [Franceschini2007]. The following does: in-place LCP construction The Longest Common Prefix ( LCP ) array can be computed in O ( n log 2 n ) expected time using O ( 1 ) words of space on top of the text and the LCP . Previous fastest in-place LCP array construction algorithm runs in quadratic time.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend