Simpler and efficient LZW-compressed multiple pattern matching
Paweł Gawrychowski July 4, 2012
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 1 / 20
Simpler and efficient LZW-compressed multiple pattern matching Pawe - - PowerPoint PPT Presentation
Simpler and efficient LZW-compressed multiple pattern matching Pawe Gawrychowski July 4, 2012 Pawe Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 1 / 20 We consider the standard pattern matching problem. Pattern
Paweł Gawrychowski July 4, 2012
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 1 / 20
We consider the standard pattern matching problem.
Pattern matching
Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 2 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp
eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp
eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp
eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
And move to its natural generalization.
Pattern matching
Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching
Given a compressed representation of a text t and a collection of patterns p1, p2, . . . , pℓ, does any pi occur in t?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
And move to its natural generalization.
Compressed pattern matching
Given a compressed representation of a text t and a pattern p, does p
As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching
Given a compressed representation of a text t and a collection of patterns p1, p2, . . . , pℓ, does any pi occur in t?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
And move to its natural generalization.
Compressed pattern matching
Given a compressed representation of a text t and a pattern p, does p
As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching
Given a compressed representation of a text t and a collection of patterns p1, p2, . . . , pℓ, does any pi occur in t?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods
Text t[1..N] is split into disjoint blocks b1b2 . . . bn. Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF! You can see that n ∈ Ω( √ N), so the best possible compression ratio is
compression and decompression.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods
Text t[1..N] is split into disjoint blocks b1b2 . . . bn. Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
You can see that n ∈ Ω( √ N), so the best possible compression ratio is
compression and decompression.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods
Text t[1..N] is split into disjoint blocks b1b2 . . . bn. Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
You can see that n ∈ Ω( √ N), so the best possible compression ratio is
compression and decompression.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods
Text t[1..N] is split into disjoint blocks b1b2 . . . bn. Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
You can see that n ∈ Ω( √ N), so the best possible compression ratio is
compression and decompression.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods
Text t[1..N] is split into disjoint blocks b1b2 . . . bn. Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
You can see that n ∈ Ω( √ N), so the best possible compression ratio is
compression and decompression.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
t[1..N] text, which after compression consists of n blocks p1, p2, . . . , pℓ patterns of total length M
LZW-compressed multiple pattern matching
Input: p1, p2, . . . , pℓ and a sequence of n blocks defining text t Output: does any pi occur in t?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 6 / 20
First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O(n log M + M) and O(n + M2).
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 7 / 20
Year later the second algorithm was improved by Kosaraju, who developed a O(n + M1+ǫ) time solution.
Gawrychowski SODA 2011
Single pattern version can be solved in O(n + M) time.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 8 / 20
If we consider more than one pattern, the situation seems significantly more challenging.
Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998
Multiple pattern version can be solved in O(n + M2) time. Is it possible to narrow the gap between single and multiple pattern versions?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 9 / 20
This paper
Multiple pattern version can be solved in O(n log M + M) or O(n + M1+ǫ) time.
1
matches the bounds of Amir et al. and Kosaraju.
2
DOES NOT use any combinatorics on words, reduces the question to simple-to-state data structure problems.
3
the same high-level idea in both algortihms. So, in a certain sense, more uniform than the previously known solutions for single pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 10 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets
A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an
red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Algorithm 1 MULTIPLE-PATTERN-MATCHING(s1, s2, . . . , sn′)
1: c ← s1 2: for k = 2, 3, . . . , n′ do 3:
add (c, sk) to P
4:
c ← prefixer(c, sk)
5: end for 6: for all (s, s′) ∈ P do 7:
detector(s, s′)
8: end for
detector(s1, s2)
Given two snippets, check if any pattern occurs in their concatenation.
prefixer(s1, s2)
Find the longest suffix of the concatenation which is a prefix of some pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 12 / 20
Algorithm 2 MULTIPLE-PATTERN-MATCHING(s1, s2, . . . , sn′)
1: c ← s1 2: for k = 2, 3, . . . , n′ do 3:
add (c, sk) to P
4:
c ← prefixer(c, sk)
5: end for 6: for all (s, s′) ∈ P do 7:
detector(s, s′)
8: end for
detector(s1, s2)
Given two snippets, check if any pattern occurs in their concatenation.
prefixer(s1, s2)
Find the longest suffix of the concatenation which is a prefix of some pattern.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 12 / 20
Consider detector(s1, s2). Let P = p1$p2$ . . . $pℓ.
s1 s2 pi[1..j] pi[j + 1..|pi|] $ $
Consider the situation in the prefix tree T r and the suffix tree T.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 13 / 20
s1 s2 pi[1..j] pi[j + 1..|pi|] $ $
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 14 / 20
By computing the pre- and post-order numbers, this reduces to preprocessing a collection of M rectilinear rectangles so that given a point we can quickly retrieve (any) rectangle containing it.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 15 / 20
Similarly, for prefixer(s1, s2) we need to preprocess a collection of weighted horizontal segments so that given a vertical segment we can quickly retrieve the heaviest segment it intersects.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 16 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from
(any) interval a given point belongs to.
Trivial solution
Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 17 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from
(any) interval a given point belongs to.
Trivial solution
Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 17 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from
(any) interval a given point belongs to.
Trivial solution
Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 17 / 20
We want better bounds, though. More precisely, we would like to be linear in either n or M.
O(n log M + M)
The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.
O(n + M1+ǫ)
We increase the out-degree of the tree to Mǫ. Then the updates become more expensive, but the depth (and so the query time) become constant.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 18 / 20
We want better bounds, though. More precisely, we would like to be linear in either n or M.
O(n log M + M)
The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.
O(n + M1+ǫ)
We increase the out-degree of the tree to Mǫ. Then the updates become more expensive, but the depth (and so the query time) become constant.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 18 / 20
Similar ideas work for the second problem, too. To get the whole solution we must fill in some details (for example, we need an efficient way of retrieving the vertices corresponding to the snippets, and, if we do not assume a constant alphabet, a fast implementation of the Aho-Corasick automaton). Nevertheless, all those detail boil down to the same ideas as above.
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 19 / 20
1
is it possible to achieve O(n + M) time for multiple patterns?
2
what about approximate pattern matching? For example, given k, can we detect an occurrence with at most k mismatches faster than in O(nmk)?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 20 / 20
1
is it possible to achieve O(n + M) time for multiple patterns?
2
what about approximate pattern matching? For example, given k, can we detect an occurrence with at most k mismatches faster than in O(nmk)?
Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 20 / 20