compressed strings and applications
play

Compressed Strings and Applications Shunsuke Inenaga Kyushu - PowerPoint PPT Presentation

PSC 2015 Faster Longest Common Extension on Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan Collaborators This work is a collaboration with: Hideo Takaaki Bannai Nishimoto Tomohiro Masayuki I Takeda


  1. PSC 2015 Faster Longest Common Extension on Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan

  2. Collaborators This work is a collaboration with: Hideo Takaaki Bannai Nishimoto Tomohiro Masayuki I Takeda

  3. Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q .

  4. Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I argue string algorithms at Prague stringology

  5. Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology

  6. Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology LCE (6, 34) = 9

  7. Background & Motivation  LCE has numerous applications, e.g., approximate pattern matching, computing palindromes, computing approximate repeats.  A string T of length u can be preprocessed in O ( u ) time and space so that each LCE query can be answered in O (1) time [Demaine et al.].  However, the O ( u ) complexity can be prohibitive for large-scaled text.  To save preprocessing time and space, we consider LCE on grammar-co compre resse ssed d text.

  8. Straight Line Program (SLP) Definition An SLP is a sequence of n productions X 1 → expr 1 , X 2 → expr 2 , ···, X n → expr n ( a ∈ Σ ) • expr i = a • expr i = X l X r ( l , r < i )  An SLP is a CFG in the Chomsky normal form which derives a single string.  SLPs model outputs of grammar-based compression algorithms (e.g., Re-pair, LZ78, LZDF, OLCA, etc).

  9. Straight Line Program (SLP) n : size (# of productions) of a given SLP S h : height of the derivation tree of S u : length of the uncompressed string T represented by SLP S

  10. Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 X 4 → X 1 X 2 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b

  11. Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u

  12. Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u  log 2 u ≤ h ≤ n always holds.  u can be exponential in n (e.g. consider string a u ).  Hence, O (poly( n )) solutions are of significance.

  13. Important Remarks X 6 6 5 X 5 X 4 4 4 3 1 1 1 2 1 2 a a a b a b  Derivation trees are only imagin inar ary (used only for explanations) and are never constructed explicitly.

  14. Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abbabbabca acbbabcbbbac

  15. Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abba bbabca ac bbabcb bbac Query output is LCE length 5

  16. What is the difficulty?  We are not allowed to expand the SLP (compressed text), since this takes O (2 n ) time in the worst case.  But we want to know the length of the longest common extension!

  17. LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T  h : height of SLP derivation tree L = O ( u )  log * u = o (log u ) L : LCE length (output)  z ≤ n (due to Rytter ’03 ) z : size of LZ77 factorization of T 

  18. LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T  h : height of SLP derivation tree L = O ( u )  log * u = o (log u ) L : LCE length (output)  z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T 

  19. Logstar (iterated logarithm) Definition The logstar ar of a positive integer u , denoted log * u , is the number of times the logarithm function needs to be iteratively applied to u until the result becomes less than or equal to 1 .  The logstar is a very slowly growing function, e.g., log * 2 65536 = 5 .

  20. LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T Fastest test  Fastes test Smal allest est h : height of SLP derivation tree L = O ( u )  deterministic preprocessing log * u = o (log u ) in many cases L : LCE length (output)  queries z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T 

  21. Our strategy  All previous algorithms work on the SLP derivation trees of two query non-terminals.  Our new algorithm does NOT work on the SLP derivation trees.  Instead, we construct a different tree of logarithmic height, based on  locally consistent parsing  signature encoding.

  22. Locally consistent parsing Lemma 1 [Mehlhorn et al., Alstrup et al.] For any integer string Y ∈ {1.. m } * in which no adjacent elements are equal (i.e. Y [ i ] ≠ Y [ i +1] ), there is a bit string d of length | Y | such that 1. no 1 ’s appear consecutively; 2. at most three 0 ’s appear consecutively; 3. each d [ i ] is determined locally, i.e., by Y [ i − D L … i − 1] and Y [ i ... i + D R ] , where D L ≤ log * m + 6 and D R ≤ 4 ; d can be computed in O (| Y |) time. 4.

  23. Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0

  24. Locally consistent parsing D L Δ R Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0, 0 ,1,0,1,0,1,0,1,0,0 D L ≤ log * m + 6 D R ≤ 4

  25. Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0  Using the bit string d, any integer string Y can be uniquely decomposed in linear time into blocks of length 2-4 .

  26. Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. a b c a c a b b c a b a c c c a T =

  27. Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each character is assigned to a unique integer called a signature. 1 2 3 1 3 1 3 1 2 1 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

  28. Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Maximal run of the same Run of the same signatures is assigned to signatures is assigned to a new signature. a new signature. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

  29. Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Apply locally consistent parsing to this string. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

  30. Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each block is assigned to a new signature. 6 8 6 9 7 7 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend