Practical and Optimal String Matching Kimmo Fredriksson Department - - PowerPoint PPT Presentation

practical and optimal string matching
SMART_READER_LITE
LIVE PREVIEW

Practical and Optimal String Matching Kimmo Fredriksson Department - - PowerPoint PPT Presentation

Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of od z, Computer Engineering Department SPIRE05 p.1/25


slide-1
SLIDE 1

Practical and Optimal String Matching

Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Ł´

z, Computer Engineering Department

SPIRE’05 – p.1/25

slide-2
SLIDE 2

Problem Setting

  • The classic string matching problem:

Given text

✁ ✂ ✄✆☎ ☎ ✝ ✞

and pattern

✟ ✂ ✄✆☎ ☎ ✠ ✞
  • ver some finite

alphabet

  • f size

, find the occurrences of

in

.

  • We focus on the case where

is relatively small

Bit-parallelism.

SPIRE’05 – p.2/25

slide-3
SLIDE 3

Previous work

Vast number of algorithms exist. Some of the most well-known are (classics):

Knuth-Morris-Prat: The first

✝ ✂

worst case time algorithm.

Boyer-Moore(-Horspool)-family: Numerous variants, sublinear

  • n average.

(bit-parallel:)

Shift-or:

✝ ✂

for

✠ ✄ ☎

(Baeza-Yates & Gonnet, 1992).

BNDM family:

✝ ✆✞✝ ✟✡✠ ✁ ✠ ✂☛ ✠ ✂
  • n average for
✠ ✄ ☎

. SBNDM (Navarro, 2001; Peltola & Tarhio, 2003), LNDM (He & Fang, 2004), FNDM (Holub & Durian, 2005).

SPIRE’05 – p.3/25

slide-4
SLIDE 4

Previous work

  • In practice, the best current algorithms for short

patterns are the BNDM-family of algorithms (Navarro & Raffinot, 2000).

SPIRE’05 – p.4/25

slide-5
SLIDE 5

This work

  • We develop a novel pattern partitioning technique that

allows us to use shift-or while skipping text characters.

  • The algorithm has optimal
✝ ✆✞✝ ✟ ✠ ✁ ✠ ✂ ☛ ✠ ✂

average case running time if

✠ ✄ ☎

.

  • Very simple to implement, simple inner loop

(comparable to plain shift-or)

very efficient in practice.

✠ ✝ ✂

worst case, but can be improved to

✝ ✂

without destroying the simplicity of the search algorithm.

SPIRE’05 – p.5/25

slide-6
SLIDE 6

Our algorithm: the idea

The algorithm is based on the preprocessing / filtering / verification paradigm.

  • The preprocessing phase generates
  • different

alignements of the pattern, each containing only every

  • th pattern character.

I.e. we partition the pattern into

  • pieces.
  • The filtering phase searches all the
  • pieces in parallel

using shift-or algorithm, reading only every

  • th text

character.

  • If any of the
  • pieces match, then we invoke a

verification algorithm.

SPIRE’05 – p.6/25

slide-7
SLIDE 7

Preprocessing

  • Given a pattern

, generate a set

✂ ✟ ✄ ☎ ☎ ☎ ☎ ✟ ✆✞✝ ✟✡✠
  • f
  • patterns as follows:
✟ ☛ ✂ ☞ ✞ ✁ ✟ ✂ ✌ ✍ ☞
☎ ✌ ✁ ✎ ☎ ☎ ☎
✄ ☎ ☞ ✁ ✎ ☎ ☎ ☎ ✑ ✠ ☛
✏ ✄✆☎
  • I.e. we generate
  • different alignments of the original

pattern

, each alignment containing only every

  • th

character.

  • Each new pattern has length
✠ ✓ ✁ ✑ ✠ ☛

.

  • The total length of the patterns is
✠ ☛
✄ ✠

.

  • For example, if
✟ ✁ ✔ ✕✗✖ ✘✗✙ ✚

and

, then

✟ ✄ ✁ ✔ ✘

,

✟ ✟ ✁ ✕ ✙

and

✟ ✜ ✁ ✖ ✚

.

SPIRE’05 – p.7/25

slide-8
SLIDE 8

Preprocessing: the rationale

Assume that

  • ccurs at
✁ ✂ ☞ ☎ ☎ ☞ ✍ ✠ ✏ ✄ ✞

.

☞ ✟ ☛ ✂
✁ ✁ ✂ ☞ ✍ ✌ ✍
☎ ✌ ✁ ☞

mod

✎ ☎ ☎ ☎ ✠ ✓ ✏ ✄✆☎ ☞

(1) We can use the set

  • as a filter for the pattern

(2) The filter needs to scan only every

  • th character of

.

SPIRE’05 – p.8/25

slide-9
SLIDE 9

Preprocessing: the rationale

c f x f x x a b c d e x x

T

f a b c d e

P

a d

P

1

P

b e

P

f a d b e c

P

2

p i

SPIRE’05 – p.9/25

slide-10
SLIDE 10

Prelude to filtering: Shift-or algorithm

  • The algorithm is based on a non-deterministic
  • automaton. The automaton for
✟ ✁
✕✗✖ ✘ ✙ ✚

is:

1 2 3 c Σ b 4 5 6 7 d e f a

  • The transitions are encoded in a table
  • f bit-masks:

For

✄ ✄ ☞ ✄ ✠

, the mask

✁ ✂✄✂ ✞

has the

th bit set to 0, iff

✟ ✂ ☞ ✞ ✁ ✂

.

  • The bit-vector

has one bit per state in the automaton, the

th bit of the vector is set to 0, iff the state

is active (initially all bits are 1).

  • It can be shown that the automaton can be simulated

as:

☎ ✆ ✁ ☎✞✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞

SPIRE’05 – p.10/25

slide-11
SLIDE 11

Prelude to filtering: Shift-or algorithm

  • If after the simulation step, the

th bit of

is zero, then

  • ccurs at
✁ ✂ ☞ ✏ ✠ ✍ ✄✆☎ ☎ ☎ ☞ ✞

. Can be detected as

✁ ☎
✠ ✂ ✁ ✁ ✠ ✠

where

✠ ✠

has only the

th bit set.

  • Clearly each step of the automaton is simulated in time
  • ✁✂
✠ ☛ ☎ ✄ ✂

, which leads to

✝ ✂ ✠ ☛ ☎ ✄ ✂

total time.

SPIRE’05 – p.11/25

slide-12
SLIDE 12

Filtering

  • The whole set
  • f patterns can be searched

simultaneously using the Shift-or algorithm (Baeza-Yates & Gonnet, 1992).

  • All the patterns are preprocessed together, as if they

were concatenated: For

✟ ✁ ✔ ✕ ✖ ✘ ✙ ✚

, we effectively preprocess a pattern

✟ ✓ ✁ ✟ ✄ ✟ ✟ ✟ ✜ ✁ ✔ ✘ ✕ ✙ ✖ ✚

.

  • If the pattern
✟ ☛

matches, then the

✁ ✌ ✍ ✄ ✂ ✠ ✓
  • th bit in

is

  • zero. This can be detected as
✁ ☎
✠ ✂ ✁ ✁ ✠ ✠

where

✠ ✠

has every

✁ ✌ ✍ ✄ ✂ ✠ ✓
  • th bit set to 1.

SPIRE’05 – p.12/25

slide-13
SLIDE 13

Filtering: the simplicity illustrated

  • Plain shift-or search:

1

☎ ✆
  • ✎✂✁
☞ ✆ ✎

2

while

☞ ✝ ✝

do

3

☎ ✆ ✁ ☎✞✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞

4

if

✁ ☎
✠ ✂ ✁ ✁ ✠ ✠

then report match

5

☞ ✆ ☞ ✍ ✄
  • Our shift-or search:

1

☎ ✆
  • ✎✂✁
☞ ✆ ✎

2

while

☞ ✝ ✝

do

3

☎ ✆ ✁ ✁ ☎
✠ ✂ ✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞

4

if

✁ ☎
✠ ✂ ✁ ✁ ✠ ✠

then Verify

5

☞ ✆ ☞ ✍
  • SPIRE’05 – p.13/25
slide-14
SLIDE 14

Verification

  • If any of the pattern pieces in
  • match, we verify if the
  • riginal pattern

matches (with the corresponding alignement).

  • Can be done by brute force algorithm, with
✠ ✂

worst case cost.

SPIRE’05 – p.14/25

slide-15
SLIDE 15

Complexity

  • The filtering time is
✝ ☛

.

  • Assuming that each character occurs with probability
✄ ☛ ☛

, the probability that

✟ ☛
  • ccurs in a given text

position is

✁ ✄ ☛ ☛ ✂
✂ ✆ ✄

.

The verification cost is on average at most

✠ ✝ ☛ ☛ ✁ ✂ ✆ ✂
  • We select
  • so that
✠ ✝ ☛ ☛ ✁ ✂ ✆ ✄ ✝ ☛
  • , i.e.
✠ ☛ ✆✞✝ ✟ ✠ ✠ ✂

.

Total average time is

✝ ✆✞✝ ✟ ✠ ✠ ☛ ✠ ✂

, which is optimal.

SPIRE’05 – p.15/25

slide-16
SLIDE 16

Long patterns

  • If

, we must use several computer words

Asymptotic running time becomes

✝ ✆✞✝ ✟ ✠ ✁ ✠ ✂ ☛ ☎ ✂

.

  • The trick in (Peltola & Tarhio, 2003) to make BNDM

work with

can be applied to our algorithm too.

Omitting the details, we obtain

✝ ✆✞✝ ✟ ✠ ✂ ✁ ✁ ✠ ✂ ☛ ✠ ✂

average time where

✑ ✁ ✠ ✏ ✄ ✂ ☛ ☎ ✒ ✍ ✄

.

  • Not optimal anymore.

SPIRE’05 – p.16/25

slide-17
SLIDE 17

Linear worst case time

  • The worst case running time is
✠ ✝ ✂

.

  • Use any
✝ ✂

worst case time algorithm for the verifications, and do the verifications incrementally, saving the search state of the worst case algorithm after each verification.

’Standard trick’, worst case becomes

✝ ✂

.

  • Not a real problem: if verification time is a problem,

then the filter does not work well, and can use the linear time algorithm instead.

SPIRE’05 – p.17/25

slide-18
SLIDE 18

Implementation

  • In modern pipelined CPUs branching is costly.

Unroll

  • times (i.e. repeat inline
  • times) the code
☎ ✆ ✁ ☎✞✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞

.

The bit positions indicating the occurrences will overflow

Reserve

extra bits per pattern to avoid interference.

✄ ✍ ✑ ✠ ☛

bits in total.

  • Verification is done only every
  • th step, for those (at

most

  • ) alignements that could match.
  • Much faster in practice.

SPIRE’05 – p.18/25

slide-19
SLIDE 19

Experimental results

  • Implementation in C, compiled using icc 8.1 with full
  • ptimizations, run in a 2.4GHZ Pentium 4 (
☎ ✁ ✛

), with 512MB RAM, running Linux 2.4.20-8.

  • 100 patterns were randomly extracted from the text.
  • Each pattern was then searched for separately.
  • We report the average speed in megabytes per second.
  • Our data: real DNA and protein data, English natural

language and random ASCII text (

☛ ✁ ✁✂

).

SPIRE’05 – p.19/25

slide-20
SLIDE 20

Experimental results

We compared against:

BNDM: (Navarro & Raffinot, 2000), competitive only for

random ASCII.

SBNDM: Simplified version of BNDM (Peltola & Tarhio,

2003), competitive only for random ASCII.

BMH, BMHS: Boyer-Moore-Horspool, and the Sunday

variant of BMH. Not competitive on any data (results omitted). Our algorihtms:

AOSO: Our basic algorithm... FAOSO: ...with loop-unrolling.

SPIRE’05 – p.20/25

slide-21
SLIDE 21

Experiments: DNA

✂✁ ✄

AOSO FAOSO BNDM SBNDM

☎ ✁ ✆

321

503

181 210

✝ ✁ ✆

539

763

312 357

✞ ✆ ✁ ✟

702

941

438 492

✞✠ ✁ ✟

1029

1229

567 598

✆✡ ✁ ☎

1079

1341

750 804

✆ ☎ ✁ ☎

1229

1525

1106 1164

✆ ✝ ✁ ☛

1427

1638

1106 1164

SPIRE’05 – p.21/25

slide-22
SLIDE 22

Experiments: proteins

✂✁ ✄

AOSO FAOSO BNDM SBNDM

☎ ✁ ✆

580

909

415 512

✝ ✁ ☎

944

1267

642 678

✞ ✆ ✁ ☎

1120

1376

816 926

✞✠ ✁ ☎

1120

1459

963 1025

✆✡ ✁ ☎

1235

1376

1175 1204

✆ ☎ ✁ ☛

1267

1338

1235 1302

✆ ✝ ✁ ✠

1302 1302 1302 1302

SPIRE’05 – p.22/25

slide-23
SLIDE 23

Experiments: natural language

✂✁ ✄

AOSO FAOSO BNDM SBNDM

☎ ✁ ✆

579

884

368 476

✝ ✁ ☎

1034

1262

685 778

✞ ✆ ✁ ☎

1144

1279

797 845

✞✠ ✁ ☛

1200

1389

831 944

✆✡ ✁ ✠

1279

1389

1013 1092

SPIRE’05 – p.23/25

slide-24
SLIDE 24

Experiments: random ASCII

✂✁ ✄

AOSO FAOSO BNDM SBNDM

☎ ✁ ✆

599 952 633

1053

✝ ✁ ☎

1124

1333

1064 1220

✞ ✆ ✁ ☎

1250

1389

1299 1282

✞✠ ✁ ☎

1351

1389

1351

1389

✆✡ ✁ ✠

1449

1471

1370 1429

SPIRE’05 – p.24/25

slide-25
SLIDE 25

Conclusions

  • Very simple to implement.
  • Very efficient in practice.
  • Optimal for short patterns (
✠ ✄ ☎

).

  • The techniques can be adapted for several other

algorithms as well, e.g.

  • Shift-add (for Hamming distance):
✝ ✁
✆✞✝ ✟ ✠ ✁ ✠ ✂ ✂ ☛ ✠ ✂

average time.

  • Any algorithm for multiple string matching can be used

in place of Shift-or.

SPIRE’05 – p.25/25