q
play

q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi - PowerPoint PPT Presentation

Fast and Linear-Time String Matching Algorithms Based on the Distances of -Gram Occurrences q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan String


  1. Fast and Linear-Time String Matching Algorithms Based on the Distances of -Gram Occurrences q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan

  2. String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20

  3. String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20

  4. String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20

  5. String matching algorithms • Knuth-Morris-Pratt (KMP) algorithm [Knuth+, 1977] • Preprocessing time : O ( m ) • Searching time : O ( n ) • Boyer-Moore algorithm [Boyer & Moore, 1977] • Preprocessing time : O ( m + σ ) • Searching time : O ( nm ) • Runs fast in practice : Text length n : Pattern length m : Alphabet size σ 3 / 20

  6. = = : Word length Our contributions n | T | m | P | ω = : Alphabet size : -gram | Σ | q q σ • Propose two string matching algorithms based on the distances of the -gram occurrences q • Both algorithms work in linear time in the input string size Fastest algorithm map for each dataset English text Genome sequence Fibonacci string 2 4 8 64 16 32 128 256 512 1024 Pattern length m Comparing 15 powerful algorithms announced from 1977 to 2019 with the proposed algorithms Algorithm Search Algorithm Preprocess Search Preprosess a ● WFR q [Cantone+, 2017] ● BNDM q [Navarro & Raffinot, 1998] O ( m+ σ ) O ( m ) O ( nm ) O ( nm ⌈ m/ ω ⌉ ) ● SBNDM q [Holub & Durian, 2005] ● LWFR q [Cantone+, 2019] O ( m+ σ ) O ( m ) O ( n ) O ( nm ⌈ m/ ω ⌉ ) ● FJS [Franek+, 2005] O ( m+ σ ) O ( n ) ● DIST q New O ( mq ) O ( nq ) ● HASH q [Leqroq, 2007] O ( mq ) ● LDIST q New O ( n ( m+q )) O ( m ) O ( n ) ● BSDM q [Faro & Leqroq, 2012] O ( m ) O ( nm ) Naive solution : O ( nm ) 4 / 20

  7. Existing algorithms

  8. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  9. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  10. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  11. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  12. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  13. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  14. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  15. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  16. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  17. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  18. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  19. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  20. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  21. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  22. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  23. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c KMP _ Shift [5] = 2 Match Mismatch Match without comparison Strong _ Bord ( j ) Input : A mismatch position in the pattern j Output : A maximum value that satisfies and k (0 ≤ k < j ) P [1 : k ] = P [ j − k : j − 1] P [ k + 1] ≠ P [ j ] ( -1 if no such exists ) k A shift amount when there is a mismatch in the -th pattern j KMP _ Shift [ j ] = j − Strong _ Bord ( j ) − 1 1 2 3 4 5 6 j P a b a b c KMP_Shift 1 1 3 3 2 5 Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  24. : Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c KMP _ Shift [5] = 2 Match Mismatch Match without comparison Strong _ Bord ( j ) Input : A mismatch position in the pattern j Output : A maximum value that satisfies and k (0 ≤ k < j ) P [1 : k ] = P [ j − k : j − 1] P [ k + 1] ≠ P [ j ] ( -1 if no such exists ) k A shift amount when there is a mismatch in the -th pattern j KMP _ Shift [ j ] = j − Strong _ Bord ( j ) − 1 1 2 3 4 5 6 j P a b a b c KMP_Shift 1 1 3 3 2 5 Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20

  25. (Treat characters as the ASCII code) : String Match HASH algorithm [Leqroq, 2007] q Mismatch Match without comparison T : a a a b a b a a b b a b a b b P = a b a a b b a b P : a b a a b b a b x h ( x ) Shift [ h ( x )] aba 681 5 a b a a b b a b baa 683 4 shift [ h ( baa )] = 4 aab 680 3 abb 682 2 q bba 685 1 bab 684 0 - 6 Others shift [ h ( x )] = m − max({ j | h ( P [ j − q + 1 : j ]) = h ( x ), q ≤ j ≤ m } ∪ { q − 1}) m − q + 1 • Determines the equivalence of -grams using the hash value of -grams q q h ( x ) = (2 q − 1 ⋅ x [1] + 2 q − 2 ⋅ x [2] + ⋯ + 2 ⋅ x [ q − 1] + x [ q ]) mod 2 8 x : Text length n : Pattern length m : Alphabet size Preprocessing time : Searching time : O ( mq ) O ( n ( m + q )) σ 7 / 20

  26. (Treat characters as the ASCII code) : String Match HASH algorithm [Leqroq, 2007] q Mismatch Match without comparison T : a a a b a b a a b b a b a b b P = a b a a b b a b P : a b a a b b a b x h ( x ) Shift [ h ( x )] aba 681 5 a b a a b b a b baa 683 4 shift [ h ( baa )] = 4 aab 680 3 abb 682 2 q bba 685 1 bab 684 0 - 6 Others shift [ h ( x )] = m − max({ j | h ( P [ j − q + 1 : j ]) = h ( x ), q ≤ j ≤ m } ∪ { q − 1}) m − q + 1 • Determines the equivalence of -grams using the hash value of -grams q q h ( x ) = (2 q − 1 ⋅ x [1] + 2 q − 2 ⋅ x [2] + ⋯ + 2 ⋅ x [ q − 1] + x [ q ]) mod 2 8 x : Text length n : Pattern length m : Alphabet size Preprocessing time : Searching time : O ( mq ) O ( n ( m + q )) σ 7 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend