three strategies for the dead zone string matching
play

Three strategies for the dead-zone string matching algorithm J. - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, L E. Prieur-Gaston and B. Watson SeqBio 2018 19 20 November 2018 Rouen, France


  1. Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, ´ L´ E. Prieur-Gaston and B. Watson SeqBio 2018 19 – 20 November 2018 – Rouen, France

  2. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 2 / 35

  3. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 3 / 35

  4. Notations finite alphabet Σ string x [0 . . m − 1] on Σ ∗ length | x | = m x is the reverse of x ( x [ m − 1] x [ m − 2] · · · x [1] x [0] ) ˜ x [ i . . j ] is a factor (substring) of x from position i to position j (both inclusive) x [0 . . i ] is a prefix x [ i . . m − 1] is a suffix u is a border of x if u is both a prefix and a suffix of x Border ( x ) is the longest border of x Daykin et al Dead-zone SeqBio’18 4 / 35

  5. Exact String Matching Problem Searching for all exact occurrences of a pattern x ( | x | = m ) in a text y ( | y | = n ) 2 variants on-line (preprocessing of the pattern) off-line (preprocessing of the text) Daykin et al Dead-zone SeqBio’18 5 / 35

  6. Exact On-Line String Matching https://smart-tool. github.io/smart/ Simone Faro and Thierry http://www-igm.univ-mlv. Lecroq fr/~lecroq/string/ The Exact Online String Christian Charras and Matching Problem: a Thierry Lecroq Review of the Most Handbook of exact string Recent Results matching algorithms ACM Computing Surveys King’s College 45 (2) (2013) 13 Publications, 2004 Daykin et al Dead-zone SeqBio’18 6 / 35

  7. Sliding window Classical solutions (KMP, BM, ...) Preprocessing of the pattern and use of a sliding window Daykin et al Dead-zone SeqBio’18 7 / 35

  8. Sliding window n y x m y x y x Daykin et al Dead-zone SeqBio’18 8 / 35

  9. Sliding window An on-line exact string matching algorithm can then be viwed as a succession of: attempts (comparison of the window content and the pattern); shift (of the window to the right). Daykin et al Dead-zone SeqBio’18 9 / 35

  10. Knuth-Morris-Pratt algorithm (1977) comparisons j y u b � = x u a � = z c k = min { ℓ | x [ | Border ℓ ( u ) | ] � = a } and z = Border k ( u ) Daykin et al Dead-zone SeqBio’18 10 / 35

  11. Boyer-Moore algorithm (1977) comparisons y v b x a v x c v . Daykin et al Dead-zone SeqBio’18 11 / 35

  12. Dead Zone strategy Bruce W. Watson and Richard E. Watson A New Family of String Pattern Matching Algorithms In: Jan Holub editor, Proceedings of the Prague Stringology Club Workshop 1997, Prague, Czech Republic, July 7, 1997 , Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, 12–23. Daykin et al Dead-zone SeqBio’18 12 / 35

  13. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  14. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  15. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  16. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  17. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  18. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  19. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  20. Our contributions Three strategies Right-to-left Right-to-left with memory Alternating searching: right – left Daykin et al Dead-zone SeqBio’18 14 / 35

  21. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 15 / 35

  22. Right-to-left i x a v � = rshift x v b lshift x c z A suffix v of the pattern matches in the text and a mismatch occurs with a at position i in the pattern. The right shift ( rshift ) consists in finding a re-occurrence of v in the pattern preceded by a symbol b different from a . The left shift ( lshift ) consists in finding the longest suffix z of the pattern preceded by a symbol c different from a . Daykin et al Dead-zone SeqBio’18 16 / 35

  23. Right-to-left right shift: similar as in the Boyer-Moore algorithm left shift: similar as in the Knuth-Morris-Pratt algorithm (but from right to left) Preprocessing phase linear in time and space Daykin et al Dead-zone SeqBio’18 17 / 35

  24. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 18 / 35

  25. Right-to-left with memory i k x a z y z b j comparisons When x [ i ] � = y [ j + i ] and x [ i + 1 . . m − 1] = y [ j + i + 1 . . j + m − 1] then skip 1 [ j + k ] = k and skip 2 [ j + k ] = k − i for i + 1 ≤ k ≤ m − 1 Daykin et al Dead-zone SeqBio’18 19 / 35

  26. Right-to-left with memory x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  27. Right-to-left with memory i x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  28. Right-to-left with memory i x y j x k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  29. Right-to-left with memory i x ? y j x ? k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  30. Right-to-left with memory x [ k − ℓ + 1 . . k ] ? = x [ i − ℓ + 1 . . i ] Longest common prefix of the suffixes of ˜ x starting at positions m − 1 − k and m − 1 − i Can be answer in constant time after linear preprocessing: RMQ on LCP of ˜ x Daykin et al Dead-zone SeqBio’18 21 / 35

  31. Right-to-left with memory skip 1 and skip 2 needs a stack: the mismatch position is not known for all the matching positions O ( n ) space Daykin et al Dead-zone SeqBio’18 22 / 35

  32. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 23 / 35

  33. Alternating searching: right – left Order of comparisons x [ m − 1] , x [0] , x [ m − 2] , x [1] , x [ m − 3] , . . . 4 shift arrays right shift after a right mismatch stored in array rsrm left shift after a right mismatch stored in array lsrm right shift after a left mismatch stored in array rslm left shift after a left mismatch stored in array lslm Daykin et al Dead-zone SeqBio’18 24 / 35

  34. Alternating searching: right – left 2 conditions occCond ′ x ( i, d ) = (0 < d ≤ i and x [ i − d ] � = x [ i ]) or ( i < d ) suffCond ′ x ( i, d ) = ( 0 < d ≤ m − 2 − i and x [ d . . m − 2 − i ] is a prefix of x and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( m − 2 − i < d ≤ i + 1 and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( i + 1 < d and x [0 . . m − d − 1] is a suffix of x ) rsrm [ i ] = min { d | occCond ′ x ( i, d ) and suffCond ′ x ( i, d ) are satisfied } . Daykin et al Dead-zone SeqBio’18 25 / 35

  35. Right shift after a right mismatch prefix u and suffix v (of the same length) of x match the text and a mismatch occurs with symbol a at position i of x : i d x u a v � = x v u ′ b the suffix v of x reoccurs preceded by a symbol b different from a and a prefix u ′ of x matches a suffix of u ; i d x u a v � = x v b only a suffix v of x reoccurs preceded by a symbol b different from a ; Daykin et al Dead-zone SeqBio’18 26 / 35

  36. Right shift after a right mismatch i d x u a v x v ′ only a prefix v ′ of x matches a suffix of v . Daykin et al Dead-zone SeqBio’18 27 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend