Practical and Optimal String Matching
Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Ł´
- d´
z, Computer Engineering Department
SPIRE’05 – p.1/25
Practical and Optimal String Matching Kimmo Fredriksson Department - - PowerPoint PPT Presentation
Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of od z, Computer Engineering Department SPIRE05 p.1/25
Practical and Optimal String Matching
Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Ł´
z, Computer Engineering Department
SPIRE’05 – p.1/25
Problem Setting
Given text
✁ ✂ ✄✆☎ ☎ ✝ ✞and pattern
✟ ✂ ✄✆☎ ☎ ✠ ✞alphabet
✡, find the occurrences of
✟in
✁.
is relatively small
☞Bit-parallelism.
SPIRE’05 – p.2/25
Previous work
Vast number of algorithms exist. Some of the most well-known are (classics):
Knuth-Morris-Prat: The first
worst case time algorithm.
Boyer-Moore(-Horspool)-family: Numerous variants, sublinear
(bit-parallel:)
Shift-or:
for
✠ ✄ ☎(Baeza-Yates & Gonnet, 1992).
BNDM family:
. SBNDM (Navarro, 2001; Peltola & Tarhio, 2003), LNDM (He & Fang, 2004), FNDM (Holub & Durian, 2005).
SPIRE’05 – p.3/25
Previous work
patterns are the BNDM-family of algorithms (Navarro & Raffinot, 2000).
SPIRE’05 – p.4/25
This work
allows us to use shift-or while skipping text characters.
average case running time if
✠ ✄ ☎.
(comparable to plain shift-or)
☞very efficient in practice.
worst case, but can be improved to
without destroying the simplicity of the search algorithm.
SPIRE’05 – p.5/25
Our algorithm: the idea
The algorithm is based on the preprocessing / filtering / verification paradigm.
alignements of the pattern, each containing only every
I.e. we partition the pattern into
using shift-or algorithm, reading only every
character.
verification algorithm.
SPIRE’05 – p.6/25
Preprocessing
, generate a set
pattern
✟, each alignment containing only every
character.
.
.
and
, then
✟ ✄ ✁ ✔ ✘,
✟ ✟ ✁ ✕ ✙and
✟ ✜ ✁ ✖ ✚.
SPIRE’05 – p.7/25
Preprocessing: the rationale
Assume that
✟.
☞ ✟ ☛ ✂mod
(1) We can use the set
(2) The filter needs to scan only every
.
SPIRE’05 – p.8/25
Preprocessing: the rationale
SPIRE’05 – p.9/25
Prelude to filtering: Shift-or algorithm
is:
1 2 3 c Σ b 4 5 6 7 d e f a
For
✄ ✄ ☞ ✄ ✠, the mask
✁ ✂✄✂ ✞has the
☞th bit set to 0, iff
✟ ✂ ☞ ✞ ✁ ✂.
has one bit per state in the automaton, the
☞th bit of the vector is set to 0, iff the state
☞is active (initially all bits are 1).
as:
☎ ✆ ✁ ☎✞✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞SPIRE’05 – p.10/25
Prelude to filtering: Shift-or algorithm
th bit of
☎is zero, then
✟. Can be detected as
✁ ☎where
✠ ✠has only the
✠th bit set.
, which leads to
total time.
SPIRE’05 – p.11/25
Filtering
simultaneously using the Shift-or algorithm (Baeza-Yates & Gonnet, 1992).
were concatenated: For
✟ ✁ ✔ ✕ ✖ ✘ ✙ ✚, we effectively preprocess a pattern
✟ ✓ ✁ ✟ ✄ ✟ ✟ ✟ ✜ ✁ ✔ ✘ ✕ ✙ ✖ ✚.
matches, then the
✁ ✌ ✍ ✄ ✂ ✠ ✓is
where
✠ ✠has every
✁ ✌ ✍ ✄ ✂ ✠ ✓SPIRE’05 – p.12/25
Filtering: the simplicity illustrated
1
☎ ✆2
while
☞ ✝ ✝do
3
☎ ✆ ✁ ☎✞✝ ✝ ✄ ✂✟ ✁ ✂ ✁ ✂ ☞ ✞ ✞4
if
✁ ☎then report match
5
☞ ✆ ☞ ✍ ✄1
☎ ✆2
while
☞ ✝ ✝do
3
☎ ✆ ✁ ✁ ☎4
if
✁ ☎then Verify
5
☞ ✆ ☞ ✍Verification
matches (with the corresponding alignement).
worst case cost.
SPIRE’05 – p.14/25
Complexity
.
, the probability that
✟ ☛position is
✁ ✄ ☛ ☛ ✂.
☞The verification cost is on average at most
.
☞Total average time is
, which is optimal.
SPIRE’05 – p.15/25
Long patterns
, we must use several computer words
☞Asymptotic running time becomes
.
work with
✠can be applied to our algorithm too.
☞Omitting the details, we obtain
average time where
.
SPIRE’05 – p.16/25
Linear worst case time
.
worst case time algorithm for the verifications, and do the verifications incrementally, saving the search state of the worst case algorithm after each verification.
☞’Standard trick’, worst case becomes
.
then the filter does not work well, and can use the linear time algorithm instead.
SPIRE’05 – p.17/25
Implementation
Unroll
.
☞The bit positions indicating the occurrences will overflow
☞Reserve
extra bits per pattern to avoid interference.
☞bits in total.
most
SPIRE’05 – p.18/25
Experimental results
), with 512MB RAM, running Linux 2.4.20-8.
language and random ASCII text (
☛ ✁ ✁✂).
SPIRE’05 – p.19/25
Experimental results
We compared against:
BNDM: (Navarro & Raffinot, 2000), competitive only for
random ASCII.
SBNDM: Simplified version of BNDM (Peltola & Tarhio,
2003), competitive only for random ASCII.
BMH, BMHS: Boyer-Moore-Horspool, and the Sunday
variant of BMH. Not competitive on any data (results omitted). Our algorihtms:
AOSO: Our basic algorithm... FAOSO: ...with loop-unrolling.
SPIRE’05 – p.20/25
Experiments: DNA
✂✁ ✄AOSO FAOSO BNDM SBNDM
☎ ✁ ✆321
503
181 210
✝ ✁ ✆539
763
312 357
✞ ✆ ✁ ✟702
941
438 492
✞✠ ✁ ✟1029
1229
567 598
✆✡ ✁ ☎1079
1341
750 804
✆ ☎ ✁ ☎1229
1525
1106 1164
✆ ✝ ✁ ☛1427
1638
1106 1164
SPIRE’05 – p.21/25
Experiments: proteins
✂✁ ✄AOSO FAOSO BNDM SBNDM
☎ ✁ ✆580
909
415 512
✝ ✁ ☎944
1267
642 678
✞ ✆ ✁ ☎1120
1376
816 926
✞✠ ✁ ☎1120
1459
963 1025
✆✡ ✁ ☎1235
1376
1175 1204
✆ ☎ ✁ ☛1267
1338
1235 1302
✆ ✝ ✁ ✠1302 1302 1302 1302
SPIRE’05 – p.22/25
Experiments: natural language
✂✁ ✄AOSO FAOSO BNDM SBNDM
☎ ✁ ✆579
884
368 476
✝ ✁ ☎1034
1262
685 778
✞ ✆ ✁ ☎1144
1279
797 845
✞✠ ✁ ☛1200
1389
831 944
✆✡ ✁ ✠1279
1389
1013 1092
SPIRE’05 – p.23/25
Experiments: random ASCII
✂✁ ✄AOSO FAOSO BNDM SBNDM
☎ ✁ ✆599 952 633
1053
✝ ✁ ☎1124
1333
1064 1220
✞ ✆ ✁ ☎1250
1389
1299 1282
✞✠ ✁ ☎1351
1389
1351
1389
✆✡ ✁ ✠1449
1471
1370 1429
SPIRE’05 – p.24/25
Conclusions
).
algorithms as well, e.g.
average time.
in place of Shift-or.
SPIRE’05 – p.25/25