Three strategies for the dead-zone string matching algorithm J. - - PowerPoint PPT Presentation

three strategies for the dead zone string matching
SMART_READER_LITE
LIVE PREVIEW

Three strategies for the dead-zone string matching algorithm J. - - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, L E. Prieur-Gaston and B. Watson SeqBio 2018 19 20 November 2018 Rouen, France


slide-1
SLIDE 1

Three strategies for the dead-zone string matching algorithm

  • J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M.

L´ eonard, L. Mouchard, ´

  • E. Prieur-Gaston and B. Watson

SeqBio 2018 19 – 20 November 2018 – Rouen, France

slide-2
SLIDE 2

Outline

1

Introduction

2

Right-to-left

3

Right-to-left with memory

4

Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 2 / 35

slide-3
SLIDE 3

Outline

1

Introduction

2

Right-to-left

3

Right-to-left with memory

4

Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 3 / 35

slide-4
SLIDE 4

Notations

finite alphabet Σ string x[0 . . m − 1] on Σ∗ length |x| = m ˜ x is the reverse of x (x[m − 1]x[m − 2] · · · x[1]x[0]) x[i . . j] is a factor (substring) of x from position i to position j (both inclusive) x[0 . . i] is a prefix x[i . . m − 1] is a suffix u is a border of x if u is both a prefix and a suffix of x Border(x) is the longest border of x

Daykin et al Dead-zone SeqBio’18 4 / 35

slide-5
SLIDE 5

Exact String Matching

Problem Searching for all exact occurrences of a pattern x (|x| = m) in a text y (|y| = n) 2 variants

  • n-line (preprocessing of the pattern)
  • ff-line (preprocessing of the text)

Daykin et al Dead-zone SeqBio’18 5 / 35

slide-6
SLIDE 6

Exact On-Line String Matching

http://www-igm.univ-mlv. fr/~lecroq/string/ Christian Charras and Thierry Lecroq Handbook of exact string matching algorithms King’s College Publications, 2004 https://smart-tool. github.io/smart/ Simone Faro and Thierry Lecroq The Exact Online String Matching Problem: a Review of the Most Recent Results ACM Computing Surveys 45(2) (2013) 13

Daykin et al Dead-zone SeqBio’18 6 / 35

slide-7
SLIDE 7

Sliding window

Classical solutions (KMP, BM, ...) Preprocessing of the pattern and use of a sliding window

Daykin et al Dead-zone SeqBio’18 7 / 35

slide-8
SLIDE 8

Sliding window

x y x y x y n m

Daykin et al Dead-zone SeqBio’18 8 / 35

slide-9
SLIDE 9

Sliding window

An on-line exact string matching algorithm can then be viwed as a succession of: attempts (comparison of the window content and the pattern); shift (of the window to the right).

Daykin et al Dead-zone SeqBio’18 9 / 35

slide-10
SLIDE 10

Knuth-Morris-Pratt algorithm (1977)

z c u a u b y x = = j comparisons k = min{ℓ | x[|Borderℓ(u)|] = a} and z = Borderk(u)

Daykin et al Dead-zone SeqBio’18 10 / 35

slide-11
SLIDE 11

Boyer-Moore algorithm (1977)

a v b v y x comparisons c v x .

Daykin et al Dead-zone SeqBio’18 11 / 35

slide-12
SLIDE 12

Dead Zone strategy

Bruce W. Watson and Richard E. Watson A New Family of String Pattern Matching Algorithms In: Jan Holub editor, Proceedings of the Prague Stringology Club Workshop 1997, Prague, Czech Republic, July 7, 1997, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, 12–23.

Daykin et al Dead-zone SeqBio’18 12 / 35

slide-13
SLIDE 13

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-14
SLIDE 14

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-15
SLIDE 15

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-16
SLIDE 16

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-17
SLIDE 17

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-18
SLIDE 18

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-19
SLIDE 19

Dead Zone strategy

Daykin et al Dead-zone SeqBio’18 13 / 35

slide-20
SLIDE 20

Our contributions

Three strategies Right-to-left Right-to-left with memory Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 14 / 35

slide-21
SLIDE 21

Outline

1

Introduction

2

Right-to-left

3

Right-to-left with memory

4

Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 15 / 35

slide-22
SLIDE 22

Right-to-left

x x x c = b a z v v lshift rshift i A suffix v of the pattern matches in the text and a mismatch occurs with a at position i in the pattern. The right shift (rshift) consists in finding a re-occurrence of v in the pattern preceded by a symbol b different from a. The left shift (lshift) consists in finding the longest suffix z of the pattern preceded by a symbol c different from a.

Daykin et al Dead-zone SeqBio’18 16 / 35

slide-23
SLIDE 23

Right-to-left

right shift: similar as in the Boyer-Moore algorithm left shift: similar as in the Knuth-Morris-Pratt algorithm (but from right to left) Preprocessing phase linear in time and space

Daykin et al Dead-zone SeqBio’18 17 / 35

slide-24
SLIDE 24

Outline

1

Introduction

2

Right-to-left

3

Right-to-left with memory

4

Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 18 / 35

slide-25
SLIDE 25

Right-to-left with memory

a z b z y x i k j comparisons When x[i] = y[j + i] and x[i + 1 . . m − 1] = y[j + i + 1 . . j + m − 1] then skip1[j + k] = k and skip2[j + k] = k − i for i + 1 ≤ k ≤ m − 1

Daykin et al Dead-zone SeqBio’18 19 / 35

slide-26
SLIDE 26

Right-to-left with memory

y x j If k = skip2[i + j] > 0, it means that x[k − ℓ + 1 . . k] = y[i + j − ℓ + 1 . . i + j] with ℓ = skip1[i + j], and furthermore x[k − ℓ] = y[i + j − ℓ] if k ≥ ℓ. We need to know whether y[i + j − ℓ + 1 . . i + j]=x[i − ℓ + 1 . . i] and thus we need to know whether x[k − ℓ + 1 . . k]=x[i − ℓ + 1 . . i].

Daykin et al Dead-zone SeqBio’18 20 / 35

slide-27
SLIDE 27

Right-to-left with memory

y x j i If k = skip2[i + j] > 0, it means that x[k − ℓ + 1 . . k] = y[i + j − ℓ + 1 . . i + j] with ℓ = skip1[i + j], and furthermore x[k − ℓ] = y[i + j − ℓ] if k ≥ ℓ. We need to know whether y[i + j − ℓ + 1 . . i + j]=x[i − ℓ + 1 . . i] and thus we need to know whether x[k − ℓ + 1 . . k]=x[i − ℓ + 1 . . i].

Daykin et al Dead-zone SeqBio’18 20 / 35

slide-28
SLIDE 28

Right-to-left with memory

y x j i x k ℓ If k = skip2[i + j] > 0, it means that x[k − ℓ + 1 . . k] = y[i + j − ℓ + 1 . . i + j] with ℓ = skip1[i + j], and furthermore x[k − ℓ] = y[i + j − ℓ] if k ≥ ℓ. We need to know whether y[i + j − ℓ + 1 . . i + j]=x[i − ℓ + 1 . . i] and thus we need to know whether x[k − ℓ + 1 . . k]=x[i − ℓ + 1 . . i].

Daykin et al Dead-zone SeqBio’18 20 / 35

slide-29
SLIDE 29

Right-to-left with memory

y x j i x k ℓ ? ? If k = skip2[i + j] > 0, it means that x[k − ℓ + 1 . . k] = y[i + j − ℓ + 1 . . i + j] with ℓ = skip1[i + j], and furthermore x[k − ℓ] = y[i + j − ℓ] if k ≥ ℓ. We need to know whether y[i + j − ℓ + 1 . . i + j]=x[i − ℓ + 1 . . i] and thus we need to know whether x[k − ℓ + 1 . . k]=x[i − ℓ + 1 . . i].

Daykin et al Dead-zone SeqBio’18 20 / 35

slide-30
SLIDE 30

Right-to-left with memory

x[k − ℓ + 1 . . k] ? = x[i − ℓ + 1 . . i] Longest common prefix of the suffixes of ˜ x starting at positions m − 1 − k and m − 1 − i Can be answer in constant time after linear preprocessing: RMQ on LCP

  • f ˜

x

Daykin et al Dead-zone SeqBio’18 21 / 35

slide-31
SLIDE 31

Right-to-left with memory

skip1 and skip2 needs a stack: the mismatch position is not known for all the matching positions O(n) space

Daykin et al Dead-zone SeqBio’18 22 / 35

slide-32
SLIDE 32

Outline

1

Introduction

2

Right-to-left

3

Right-to-left with memory

4

Alternating searching: right – left

Daykin et al Dead-zone SeqBio’18 23 / 35

slide-33
SLIDE 33

Alternating searching: right – left

Order of comparisons x[m − 1], x[0], x[m − 2], x[1], x[m − 3], . . . 4 shift arrays right shift after a right mismatch stored in array rsrm left shift after a right mismatch stored in array lsrm right shift after a left mismatch stored in array rslm left shift after a left mismatch stored in array lslm

Daykin et al Dead-zone SeqBio’18 24 / 35

slide-34
SLIDE 34

Alternating searching: right – left

2 conditions

  • ccCond′

x(i, d) = (0 < d ≤ i and x[i − d] = x[i]) or (i < d)

suffCond′

x(i, d) =

( 0 < d ≤ m − 2 − i and x[d . . m − 2 − i] is a prefix of x and x[i − d + 1 . . m − d − 1] is a suffix of x )

  • r

( m − 2 − i < d ≤ i + 1 and x[i − d + 1 . . m − d − 1] is a suffix of x )

  • r

( i + 1 < d and x[0 . . m − d − 1] is a suffix of x ) rsrm[i] = min{d | occCond′

x(i, d) and suffCond′ x(i, d) are satisfied}.

Daykin et al Dead-zone SeqBio’18 25 / 35

slide-35
SLIDE 35

Right shift after a right mismatch

prefix u and suffix v (of the same length) of x match the text and a mismatch occurs with symbol a at position i of x: x x u u′ b = a v v d i the suffix v of x reoccurs preceded by a symbol b different from a and a prefix u′ of x matches a suffix of u; x x u b = a v v d i

  • nly a suffix v of x reoccurs preceded by a symbol b different from a;

Daykin et al Dead-zone SeqBio’18 26 / 35

slide-36
SLIDE 36

Right shift after a right mismatch

x x u a v′ v i d

  • nly a prefix v′ of x matches a suffix of v.

Daykin et al Dead-zone SeqBio’18 27 / 35

slide-37
SLIDE 37

Right shift after a right mismatch

x x u u′ b = a v v d i the suffix v of x reoccurs preceded by a symbol b different from a and a prefix u′ of x matches a suffix of u; For 0 ≤ i ≤ m − 1 Lemma rsrm[m − 1 − suff [i]] ≤ m − 1 − i if m − 1 − i < suff [i] and pref [m − 1 − i] ≥ suff [i] − m + 1 + i where suff [i] is the length of the longest common suffix between x[0 . . i] and x and pref [i] is the length of the longest common prefix between x[. . m − 1] and x (classical computation in linear time and space).

Daykin et al Dead-zone SeqBio’18 28 / 35

slide-38
SLIDE 38

Right shift after a right mismatch

x x u b = a v v d i

  • nly a suffix v of x reoccurs preceded by a symbol b different from a;

For 0 ≤ i ≤ m − 1 Lemma rsrm[m − 1 − suff [i]] ≤ m − 1 − i if m − 1 − i ≥ suff [i]

Daykin et al Dead-zone SeqBio’18 29 / 35

slide-39
SLIDE 39

Right shift after a right mismatch

x x u a v′ v j i d

  • nly a prefix v′ of x matches a suffix of v.

For 0 ≤ i ≤ m − 1 Lemma rsrm[i] ≤ m − suff [j] where j = max{k | 0 ≤ k < m − 1 − i and suff [k] = k + 1}

Daykin et al Dead-zone SeqBio’18 30 / 35

slide-40
SLIDE 40

Right shift after a right mismatch

Lemma rsrm can be computed in linear time The 3 other arrays can be computed similarly = ⇒ preprocessing in linear time

Daykin et al Dead-zone SeqBio’18 31 / 35

slide-41
SLIDE 41

Reference

  • J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. L´

eonard,

  • L. Mouchard, ´
  • E. Prieur-Gaston and B. Watson

Three Strategies for the Dead-Zone String Matching Algorithm In: (J. Holub and J. ˇ Zd´ arek editors, Proceedings of the Prague Stringology Conference 2018 (PSC 2018), Prague, Czech Republic, 2018) 117–128.

Daykin et al Dead-zone SeqBio’18 32 / 35

slide-42
SLIDE 42

Perspectives

It remains to do exact complexity analysis experimentations parallel implementation explore deterministic sampling (Vishkin, 1990) compute a w-matching machine (Didier, 2018)

Daykin et al Dead-zone SeqBio’18 33 / 35

slide-43
SLIDE 43

Guest Editors:

  • Prof. Dr. Thierry Lecroq

LITIS EA 4108, Normastic FR3638, IRIB, University of Rouen Normandie, Normandie University, Rouen, France Thierry.Lecroq@univ-rouen.fr

  • Prof. Dr. Simone Faro

Department of Mathematics and Computer Science, University of Catania, I-95125 Catania, Italy faro@dmi.unict.it Deadline for manuscript submissions: 28 February 2019

Message from the Guest Editors Dear Colleagues, With the rapid growth of available data in almost all fields, there is large demand for efficient pattern-matching

  • algorithms. We invite you to submit your latest research in

the area of string matching (single or multiple, on-line or

  • ff-line, exact or approximate, in uncompressed or

compressed form) describing new data structures and/or new algorithms. High-quality papers are solicited to address both theoretical and practical issues of string matching including, but not restricted to, natural language processing, text mining, bioinformatics (DNA, RNA, protein sequences), chemoinformatics, intrusion detection, security, plagiarism detection, digital forensics, video retrieval, and music analysis.

  • Prof. Dr. Thierry Lecroq
  • Prof. Dr. Simone Faro

Guest Editors an Open Access Journal by MDPI

String Matching and Its Applications

mdpi.com/si/17457

Special

Issue

Daykin et al Dead-zone SeqBio’18 34 / 35

slide-44
SLIDE 44

Thank you for your attention!

Daykin et al Dead-zone SeqBio’18 35 / 35