time space trade offs for the longest common substring
play

Time-Space Trade-Offs for the Longest Common Substring Problem - PowerPoint PPT Presentation

Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhj 2 1 Moscow State University, Department of Mechanics and Mathematics, tat.starikovskaya@gmail.com 2 Technical University of Denmark,


  1. Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhøj 2 1 Moscow State University, Department of Mechanics and Mathematics, tat.starikovskaya@gmail.com 2 Technical University of Denmark, DTU Compute, hwv@hwv.dk CPM 2013, Bad Herrenalb, Germany June 17, 2013 1 / 27

  2. The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t 2 / 27

  3. The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g 3 / 27

  4. The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g d = 2 ⇒ LCS = c t a c c 4 / 27

  5. The Longest Common Substring Problem A patented solution 5 / 27

  6. 6 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 a c c 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 c 4 g 3 g 2 T 1 = a 1

  7. 7 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 cc a 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 c 4 g 3 g 2 T 1 = a 1

  8. 8 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 cc a 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 Θ( n ) c 4 g 3 � g 2 Space: T 1 = a 1

  9. Our Results Question � n 1 − ε � Can the LCS problem be solved (deterministically) in O space � n 1 + ε � and O time for 0 ≤ ε ≤ 1? Our Answer Yes if 0 ≤ ε ≤ 1 3 . More precisely, For two strings ( d = m = 2), the problem can be solved in: � n 1 + ε � Time: O for any 0 < ε ≤ 1 3 . � n 1 − ε � Space: O In the general case (2 ≤ d ≤ m ), the problem can be solved in: n 1 + ε log 2 n ( d log 2 n + d 2 ) � � Time: O for any 0 ≤ ε < 1 3 . � n 1 − ε � Space: O 9 / 27

  10. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 10 / 27

  11. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 DC τ DC τ DC τ DC τ DC τ DC τ Difference Covers A difference cover modulo τ is a set of integers DC τ ⊆ { 0 , 1 , . . . , τ − 1 } such that for any distance d ∈ { 0 , 1 , . . . , τ − 1 } , DC τ contains two elements separated by distance d modulo τ . Ex: The set DC τ = { 1 , 2 , 4 } is a difference cover modulo 5. 4 4 d 0 1 2 3 4 3 3 0 2 i , j 1 , 1 2 , 1 1 , 4 4 , 1 1 , 2 2 1 1 11 / 27

  12. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 DC τ DC τ DC τ DC τ DC τ DC τ � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ 12 / 27

  13. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. 13 / 27

  14. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. 14 / 27

  15. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. ◮ Hence the LCS corresponds to a pair ( p ∗ 1 , p ∗ 2 ) maximizing � � � � lcp RB ( p 1 ) , RB ( p 2 ) + lcp T [ p 1 .. ] , T [ p 2 .. ] − 1 15 / 27

  16. A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 RB ( 11 )= ( g c t a c ) R = c a t c g � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. ◮ Hence the LCS corresponds to a pair ( p ∗ 1 , p ∗ 2 ) maximizing � � � � lcp RB ( p 1 ) , RB ( p 2 ) + lcp T [ p 1 .. ] , T [ p 2 .. ] − 1 16 / 27

  17. A Solution for Two Strings When the LCS is long � n 2 � How to compute the pair ( p ∗ 1 , p ∗ 2 ) faster than O ? τ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 SA τ = [ 14 21 17 26 , , , , 6 , 1 , 16 22 11 12 19 24 , , , , , , 4 , 27 , 7 , 2 , 9 ] LCP τ = [ 0 , 3 , 1 , 2 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , 4 , 0 , 1 , 1 , 0 ] SA R τ = [ 14 , 1 , 17 21 26 , , , 6 , 16 22 11 19 12 24 , , , , , , 4 , 2 , 27 , 7 , 9 ] LCP R τ = [ 0 , 1 , 1 , 4 , 3 , 0 , 2 , 4 , 1 , 3 , 2 , 1 , 0 , 2 , 4 , 0 ] Main observation: lcp ( T [ p ∗ 1 .. ] , T [ p ∗ 2 .. ]) ∈ [ ℓ max − τ + 1 ; ℓ max ] , so we can ignore all pairs with lcp values smaller than ℓ max − τ + 1. 17 / 27

  18. A Solution for Two Strings When the LCS is long � n 2 � How to compute the pair ( p ∗ 1 , p ∗ 2 ) faster than O ? τ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 SA τ = [ 14 21 17 26 , , , , 6 , 1 , 16 22 11 12 19 24 , , , , , , 4 , 27 , 7 , 2 , 9 ] LCP τ = [ 0 , 3 , 1 , 2 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , 4 , 0 , 1 , 1 , 0 ] SA R τ = [ 14 , 1 , 17 21 26 , , , 6 , 16 22 11 19 12 24 , , , , , , 4 , 2 , 27 , 7 , 9 ] LCP R τ = [ 0 , 1 , 1 , 4 , 3 , 0 , 2 , 4 , 1 , 3 , 2 , 1 , 0 , 2 , 4 , 0 ] Main observation: lcp ( T [ p ∗ 1 .. ] , T [ p ∗ 2 .. ]) ∈ [ ℓ max − τ + 1 ; ℓ max ] , so we can ignore all pairs with lcp values smaller than ℓ max − τ + 1. 18 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend