q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi - - PowerPoint PPT Presentation

q
SMART_READER_LITE
LIVE PREVIEW

q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi - - PowerPoint PPT Presentation

Fast and Linear-Time String Matching Algorithms Based on the Distances of -Gram Occurrences q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan String


slide-1
SLIDE 1

Fast and Linear-Time String Matching Algorithms Based on the Distances of -Gram Occurrences

q

Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan

slide-2
SLIDE 2

/ 20 2

String matching problem

Input Text , Pattern Output All positions in such that

T P i T T[i : i + |P| − 1] = P

Output : 6, 9 Example

1 2 3 4 5 6 7 8 9 10 11 12 13

T : a b a a b a b b a b b a b P : a b b a

Naive solution:O(nm)

= =

n |T | m |P|

slide-3
SLIDE 3

/ 20 2

String matching problem

Input Text , Pattern Output All positions in such that

T P i T T[i : i + |P| − 1] = P

Output : 6, 9 Example

1 2 3 4 5 6 7 8 9 10 11 12 13

T : a b a a b a b b a b b a b P : a b b a

Naive solution:O(nm)

= =

n |T | m |P|

slide-4
SLIDE 4

/ 20 2

String matching problem

Input Text , Pattern Output All positions in such that

T P i T T[i : i + |P| − 1] = P

Output : 6, 9 Example

1 2 3 4 5 6 7 8 9 10 11 12 13

T : a b a a b a b b a b b a b P : a b b a

Naive solution:O(nm)

= =

n |T | m |P|

slide-5
SLIDE 5

/ 20

String matching algorithms

  • Knuth-Morris-Pratt (KMP) algorithm [Knuth+, 1977]
  • Preprocessing time :
  • Searching time :
  • Boyer-Moore algorithm [Boyer & Moore, 1977]
  • Preprocessing time :
  • Searching time :
  • Runs fast in practice

O(m) O(n) O(m + σ) O(nm)

3

: Text length : Pattern length : Alphabet size

n m σ

slide-6
SLIDE 6

/ 20

Our contributions

  • Propose two string matching algorithms based on the distances of the -gram occurrences
  • Both algorithms work in linear time in the input string size

q

4

English text Genome sequence Fibonacci string

2 4 8 16 32 64 128 256 512 1024

Pattern length m

Fastest algorithm map for each dataset

= = : Word length = :Alphabet size : -gram

n |T | m |P| ω σ |Σ| q q

Comparing 15 powerful algorithms announced from 1977 to 2019 with the proposed algorithms

Algorithm

Preprosess

Search

  • BNDMq [Navarro & Raffinot, 1998]

O(m+σ)

O(nm⌈m/ω⌉)

  • SBNDMq [Holub & Durian, 2005]

O(m+σ)

O(nm⌈m/ω⌉)

  • FJS [Franek+, 2005]

O(m+σ) O(n)

  • HASHq [Leqroq, 2007]

O(mq)

O(n(m+q))

  • BSDMq [Faro & Leqroq, 2012]

O(m) O(nm) Algorithm

Preprocess a

Search

  • WFRq [Cantone+, 2017]

O(m) O(nm)

  • LWFRq [Cantone+, 2019]

O(m) O(n)

  • DISTq New

O(mq) O(nq)

  • LDISTq New

O(m) O(n) Naive solution:O(nm)

slide-7
SLIDE 7

Existing algorithms

slide-8
SLIDE 8

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-9
SLIDE 9

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-10
SLIDE 10

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-11
SLIDE 11

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-12
SLIDE 12

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-13
SLIDE 13

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-14
SLIDE 14

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-15
SLIDE 15

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-16
SLIDE 16

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-17
SLIDE 17

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-18
SLIDE 18

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-19
SLIDE 19

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-20
SLIDE 20

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-21
SLIDE 21

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-22
SLIDE 22

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

:Text length :Pattern length

n m

Preprocessing time: Searching time:

O(m) O(n)

slide-23
SLIDE 23

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

Input : A mismatch position in the pattern Output : A maximum value that satisfies and (-1 if no such exists) A shift amount when there is a mismatch in the -th pattern

Strong_Bord(j) j k(0 ≤ k < j) P[1 : k] = P[j − k : j − 1] P[k + 1] ≠ P[j] k j KMP_Shift[j] = j − Strong_Bord(j) − 1

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

j

1 2 3 4 5 6

P

a b a b c

KMP_Shift 1 1 3 3 2 5

:Text length :Pattern length

n m

KMP_Shift[5] = 2

Preprocessing time: Searching time:

O(m) O(n)

slide-24
SLIDE 24

/ 20 6

KMP algorithm [Knuth+, 1977]

Match without comparison Mismatch Match

Input : A mismatch position in the pattern Output : A maximum value that satisfies and (-1 if no such exists) A shift amount when there is a mismatch in the -th pattern

Strong_Bord(j) j k(0 ≤ k < j) P[1 : k] = P[j − k : j − 1] P[k + 1] ≠ P[j] k j KMP_Shift[j] = j − Strong_Bord(j) − 1

T : a b a b a b b b c a a c a c P : a b a b c a b a b c

j

1 2 3 4 5 6

P

a b a b c

KMP_Shift 1 1 3 3 2 5

:Text length :Pattern length

n m

KMP_Shift[5] = 2

Preprocessing time: Searching time:

O(m) O(n)

slide-25
SLIDE 25

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

shift[h(baa)] = 4 : String

(Treat characters as the ASCII code)

x

q

slide-26
SLIDE 26

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

shift[h(baa)] = 4 : String

(Treat characters as the ASCII code)

x

q

slide-27
SLIDE 27

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

shift[h(baa)] = 4 : String

(Treat characters as the ASCII code)

x

q

slide-28
SLIDE 28

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

shift[h(baa)] = 4 : String

(Treat characters as the ASCII code)

x

q

slide-29
SLIDE 29

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

: String

(Treat characters as the ASCII code)

x

q

slide-30
SLIDE 30

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

: String

(Treat characters as the ASCII code)

x

q

slide-31
SLIDE 31

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

: String

(Treat characters as the ASCII code)

x

q

slide-32
SLIDE 32

/ 20 7

HASH algorithm [Leqroq, 2007]

q

Preprocessing time: Searching time:

O(mq) O(n(m + q))

: Text length : Pattern length : Alphabet size

n m σ

Match without comparison Mismatch Match

  • Determines the equivalence of -grams using the hash value of -grams

shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (2q−1 ⋅ x[1] + 2q−2 ⋅ x[2] + ⋯ + 2 ⋅ x[q − 1] + x[q]) mod 28

x h(x) Shift[h(x)]

aba

681 5

baa

683 4

aab

680 3

abb

682 2

bba

685 1

bab

684

Others

  • 6

P = a b a a b b a b

m − q + 1

T : a a a b a b a a b b a b a b b P : a b a a b b a b

a b a a b b a b

: String

(Treat characters as the ASCII code)

x

q

slide-33
SLIDE 33

Proposed 1 DIST algorithm

q

slide-34
SLIDE 34

/ 20

Idea of DIST algorithm

q

9 i

1 2 3 4 5 6 7 8 9

P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • gram distance array
  • A hash value is used to determine the equivalence of -grams

q

dist[i] = min({ j ∣ h(P[i − j − q + 1 : i − j]) = h(P[i − q + 1 : i]), q − 1 ≤ j < i } ∪ {i − q + 1}) q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

4 : String x

T :

b b a b a b b a a b b a b a a

P :

a b a a b b a a a a b a a b b a a a

When q = 3 dist[8] = 4

(HASH ) + -gram distance array + KMP algorithm

shift q q

Linear time Practically fast

Proposed

slide-35
SLIDE 35

/ 20

Idea of DIST algorithm

q

9 i

1 2 3 4 5 6 7 8 9

P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • gram distance array
  • A hash value is used to determine the equivalence of -grams

q

dist[i] = min({ j ∣ h(P[i − j − q + 1 : i − j]) = h(P[i − q + 1 : i]), q − 1 ≤ j < i } ∪ {i − q + 1}) q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

4 : String x

T :

b b a b a b b a a b b a b a a

P :

a b a a b b a a a a b a a b b a a a

When q = 3 dist[8] = 4

(HASH ) + -gram distance array + KMP algorithm

shift q q

Linear time Practically fast

Proposed

slide-36
SLIDE 36

/ 20

Idea of DIST algorithm

q

9 i

1 2 3 4 5 6 7 8 9

P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • gram distance array
  • A hash value is used to determine the equivalence of -grams

q

dist[i] = min({ j ∣ h(P[i − j − q + 1 : i − j]) = h(P[i − q + 1 : i]), q − 1 ≤ j < i } ∪ {i − q + 1}) q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

4 : String x

T :

b b a b a b b a a b b a b a a

P :

a b a a b b a a a a b a a b b a a a

When q = 3 dist[8] = 4

(HASH ) + -gram distance array + KMP algorithm

shift q q

Linear time Practically fast

Proposed

slide-37
SLIDE 37

/ 20

Idea of DIST algorithm

q

9 i

1 2 3 4 5 6 7 8 9

P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • gram distance array
  • A hash value is used to determine the equivalence of -grams

q

dist[i] = min({ j ∣ h(P[i − j − q + 1 : i − j]) = h(P[i − q + 1 : i]), q − 1 ≤ j < i } ∪ {i − q + 1}) q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

4 : String x

T :

b b a b a b b a a b b a b a a

P :

a b a a b b a a a a b a a b b a a a

When q = 3 dist[8] = 4

(HASH ) + -gram distance array + KMP algorithm

shift q q

Linear time Practically fast

Proposed

slide-38
SLIDE 38

/ 20 10

Array (Almost same as

  • f HASH )

HQ_Shift shift q

  • Also determines the equivalence of -grams using the hash value of -grams

HQ_shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

x h(x) HQ_Shift[h(x)]

aba

2041 5

baa

2053 4

aab

2038 3

abb

2042 2

bba

2057 1

bab

2054

Others

  • 6

P = a b a a b b a b

m − q + 1

Use this shift to align the -gram in the pattern and the -gram in the text which has the same hash value

q q

  • ハッシュ関数

T : b b a a a b a b a a b a b b a P :

a b a a b b a b a b a a b b a b

HQ_Shift[h(baa)] = 4

slide-39
SLIDE 39

/ 20 10

Array (Almost same as

  • f HASH )

HQ_Shift shift q

  • Also determines the equivalence of -grams using the hash value of -grams

HQ_shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

x h(x) HQ_Shift[h(x)]

aba

2041 5

baa

2053 4

aab

2038 3

abb

2042 2

bba

2057 1

bab

2054

Others

  • 6

P = a b a a b b a b

m − q + 1

Use this shift to align the -gram in the pattern and the -gram in the text which has the same hash value

q q

  • ハッシュ関数

T : b b a a a b a b a a b a b b a P :

a b a a b b a b a b a a b b a b

HQ_Shift[h(baa)] = 4

slide-40
SLIDE 40

/ 20 10

Array (Almost same as

  • f HASH )

HQ_Shift shift q

  • Also determines the equivalence of -grams using the hash value of -grams

HQ_shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

x h(x) HQ_Shift[h(x)]

aba

2041 5

baa

2053 4

aab

2038 3

abb

2042 2

bba

2057 1

bab

2054

Others

  • 6

P = a b a a b b a b

m − q + 1

Use this shift to align the -gram in the pattern and the -gram in the text which has the same hash value

q q

  • ハッシュ関数

T : b b a a a b a b a a b a b b a P :

a b a a b b a b a b a a b b a b

HQ_Shift[h(baa)] = 4

slide-41
SLIDE 41

/ 20 10

Array (Almost same as

  • f HASH )

HQ_Shift shift q

  • Also determines the equivalence of -grams using the hash value of -grams

HQ_shift[h(x)] = m − max({j | h(P[j − q + 1 : j]) = h(x), q ≤ j ≤ m} ∪ {q − 1}) q q h(x) = (4q−1 ⋅ x[1] + 4q−2 ⋅ x[2] + ⋯ + 4 ⋅ x[q − 1] + x[q]) mod 216

x h(x) HQ_Shift[h(x)]

aba

2041 5

baa

2053 4

aab

2038 3

abb

2042 2

bba

2057 1

bab

2054

Others

  • 6

P = a b a a b b a b

m − q + 1

Use this shift to align the -gram in the pattern and the -gram in the text which has the same hash value

q q

  • ハッシュ関数

T : b b a a a b a b a a b a b b a P :

a b a a b b a b a b a a b b a b

HQ_Shift[h(baa)] = 4

slide-42
SLIDE 42

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-43
SLIDE 43

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-44
SLIDE 44

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-45
SLIDE 45

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-46
SLIDE 46

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-47
SLIDE 47

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

Shift the pattern using the distance array if the first letter do not match

dist

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-48
SLIDE 48

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

Shift the pattern using the distance array if the first letter do not match

dist

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-49
SLIDE 49

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

Shift the pattern using the distance array if the first letter do not match

dist

11

Alignment-Phase Align “baa” by shifting with array until P[1] matches the corresponding text

HQ_Shift T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-50
SLIDE 50

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-51
SLIDE 51

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase Align “bba” by shifting with array until P[1] matches the corresponding text

HQ_Shift

Match without comparison Mismatch Match

slide-52
SLIDE 52

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase Align “bba” by shifting with array until P[1] matches the corresponding text

HQ_Shift

Match without comparison Mismatch Match

slide-53
SLIDE 53

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase Align “bba” by shifting with array until P[1] matches the corresponding text

HQ_Shift

Match without comparison Mismatch Match

slide-54
SLIDE 54

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase Align “bba” by shifting with array until P[1] matches the corresponding text

HQ_Shift

Match without comparison Mismatch Match

slide-55
SLIDE 55

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Comparison-Phase compare P[2:m] from left to right if the first letter match

Match without comparison Mismatch Match

slide-56
SLIDE 56

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Comparison-Phase compare P[2:m] from left to right if the first letter match

Match without comparison Mismatch Match

slide-57
SLIDE 57

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-58
SLIDE 58

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

Comparison-Phase Select the one where the resumption position of the character comparison goes further to the right

KMP_Shift[2] = 1 dist[7] = 5

slide-59
SLIDE 59

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

Comparison-Phase Select the one where the resumption position of the character comparison goes further to the right

KMP_Shift[2] = 1 dist[7] = 5

slide-60
SLIDE 60

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

Comparison-Phase Select the one where the resumption position of the character comparison goes further to the right

KMP_Shift[2] = 1 dist[7] = 5

slide-61
SLIDE 61

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-62
SLIDE 62

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase

Align “aba” by shifting with array until P[1] matches the corresponding text character

HQ_Shift

Match without comparison Mismatch Match

slide-63
SLIDE 63

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase

Align “aba” by shifting with array until P[1] matches the corresponding text character

HQ_Shift

Match without comparison Mismatch Match

slide-64
SLIDE 64

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase

Align “aba” by shifting with array until P[1] matches the corresponding text character

HQ_Shift

Match without comparison Mismatch Match

slide-65
SLIDE 65

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8 Alignment-Phase

Align “aba” by shifting with array until P[1] matches the corresponding text character

HQ_Shift

Match without comparison Mismatch Match

slide-66
SLIDE 66

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-67
SLIDE 67

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-68
SLIDE 68

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-69
SLIDE 69

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-70
SLIDE 70

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-71
SLIDE 71

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-72
SLIDE 72

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Comparison-Phase Select the one where the resumption position of the character comparison goes further to the right

KMP_Shift[6] = 3 dist[3] = 1

Match without comparison Mismatch Match

slide-73
SLIDE 73

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Comparison-Phase Select the one where the resumption position of the character comparison goes further to the right

KMP_Shift[6] = 3 dist[3] = 1

Match without comparison Mismatch Match

slide-74
SLIDE 74

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-75
SLIDE 75

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-76
SLIDE 76

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-77
SLIDE 77

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-78
SLIDE 78

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-79
SLIDE 79

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-80
SLIDE 80

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-81
SLIDE 81

/ 20

x h(x)

HQ_Shift[h(x)]

aba

2041 6

baa

2053 1

aab

2038 4

abb

2042 3

bba

2054 2

aaa

2037

Others

  • 7

Searching

11

T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a P : a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a a b a a b b a a a a

j 1 2 3 4 5 6 7 8 9 10 P

a b a a b b a a a

dist

  • 1

2 3 4 5 4 7

  • KMP_Shift

1 1 3 2 4 3 7 6 7 8

Match without comparison Mismatch Match

slide-82
SLIDE 82

/ 20 12

Time complexity of DIST algorithm

q

  • Preprocessing
  • Array

:

  • Array

:

  • Calculate hash value that takes

time at positions

  • Array

:

  • Searching
  • Number of character comparisons :
  • Calculate hash value at maximum

positions :

KMP_Shift O(m) HQ_Shift O(mq) O(q) m − q + 1 dist O(mq) O(n) n − m + 1 O(nq)

T :

a a a a a a a a a

P :

b a a a a a b a a a a a b a a a a a b a a a a a

O(mq) O(nq)

slide-83
SLIDE 83

Proposed 2 Linear-DIST (LDIST ) algorithm q q

slide-84
SLIDE 84

/ 20 14

LDIST algorithm

q

  • Worst-case time complexity of the search phase of DISTq algorithm is
  • Since the hash function used in DIST algorithm is a rolling hash, when

has already been obtained, the hash value of can be computed in time ( )

  • Searching time can be reduced to
  • by incrementally calculating the hash value of the -gram using the previously calculated

value of the other -gram

  • Preprocessing time is also reduced to

Θ(nq) h q h(T[i : j]) h(T[i + 1 : j + 1]) O(1) 1 ≤ i ≤ j < |T| h(T[i + 1 : j + 1]) = (4 ⋅ (h(T[i : j]) − 16 ⋅ T[i]) + T[j + 1]) mod 216 O(n) q q O(m)

T :

a a a a a a a a a

P :

b a a a a a b a a a a a b a a a a a b a a a a a

h(x[1 : 3]) = (16 ⋅ x[1] + 4 ⋅ x[2] + x[3]) mod 216

slide-85
SLIDE 85

Experiments

・Implemented with C language ・Compiled with GCC9.2.0 ・MacBook Pro (13-inch,2018),macOS Catalina,Intel Core i7 2.7GHz quad core,16GB memory

Datasets

  • English text
  • Genome sequence
  • Fibonacci string
  • Texts with frequent pattern occurrences
slide-86
SLIDE 86

/ 20 16

English text

  • Use The King James version of the Bible as text
  • Patterns are randomly extracted from text

n = 4017009 σ = 62

The value of or giving the best performance is shown in round brackets

q w

Unit : millisecond

I n t h e b e g i n n i n g ,

algorithms that run in linear time in the input string size are marked with

= = = :Alphabet size

n |T | m |P| σ |Σ|

slide-87
SLIDE 87

/ 20 17

Genome sequence

  • Genome sequence of E. coli
  • Patterns are randomly extracted from text

n = 4641652 σ = 4

A T C G G T A G A G T A G A T A G

The value of or giving the best performance is shown in round brackets

q w

Unit : millisecond

algorithms that run in linear time in the input string size are marked with

= = = :Alphabet size

n |T | m |P| σ |Σ|

slide-88
SLIDE 88

/ 20 18

Fibonacci string

  • Definition

, ,

  • Use

as text

  • Patterns are randomly extracted from text

Fib1 = 𝚌 Fib2 = 𝚋 Fibn = Fibn−1 ⋅ Fibn−2 for n > 2 Fib32 n = 2178309 σ = 2

a b a a b a b a a b a a b a b a a b a b a a b

The value of or giving the best performance is shown in round brackets

q w

Unit : millisecond

algorithms that run in linear time in the input string size are marked with

= = = :Alphabet size

n |T | m |P| σ |Σ|

slide-89
SLIDE 89

/ 20

In experiment of Fibonacci string

  • There are many repeating structures
  • The pattern is extracted from the text, so the number of pattern occurrences is

very large

19

abaababaabaababaababaabaababaabaababaababaabaababaababa

Fib10 =

baaba

P =

abaababaabaababaababaabaababaabaababaababaabaababaababa

Fib10 =

aababaab

P =

Hypothesize

Efficiency of proposed algorithms do not decrease when number of pattern occurrences is large

slide-90
SLIDE 90

/ 20 20

Texts with frequent pattern occurrences

  • Generated by intentionally embedding a lot of randomly generated patterns without overlapping
  • Fixed pattern length

n = 4000000

: Pattern occurrences

  • cc

m = 8, σ = 4

= = = :Alphabet size

n |T | m |P| σ |Σ|

The value of or giving the best performance is shown in round brackets

q w

Unit : millisecond

algorithms that run in linear time in the input string size are marked with

slide-91
SLIDE 91

/ 20

Conclusion

21

  • Proposed two string matching algorithms based on the distances of the q-gram occurrences
  • Both algorithms run in linear time in the input string size

English text Genome sequence Fibonacci string

2 4 8 16 32 64 128 256 512 1024

Pattern length m

Fastest algorithm map for each dataset

Comparing 15 powerful algorithms announced from 1977 to 2019 with the proposed algorithms

Algorithm

Preprosess

Search

  • BNDMq [Navarro & Raffinot, 1998]

O(m+σ)

O(nm⌈m/ω⌉)

  • SBNDMq [Holub & Durian, 2005]

O(m+σ)

O(nm⌈m/ω⌉)

  • FJS [Franek+, 2005]

O(m+σ) O(n)

  • HASHq [Leqroq, 2007]

O(mq)

O(n(m+q))

  • BSDMq [Faro & Leqroq, 2012]

O(m) O(nm) Algorithm

Preprocess a

Search

  • WFRq [Cantone+, 2017]

O(m) O(nm)

  • LWFRq [Cantone+, 2019]

O(m) O(n)

  • DISTq New

O(mq) O(nq)

  • LDISTq New

O(m) O(n) Naive solution:O(nm)

= = : word length = :alphabet size : -gram

n |T | m |P| ω σ |Σ| q q