pattern matching
play

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 - PowerPoint PPT Presentation

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern Matching 1 Outline and Reading Strings (11.1) Pattern matching algorithms Brute-force algorithm (11.2.1) Boyer-Moore algorithm (11.2.2)


  1. Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern Matching 1

  2. Outline and Reading Strings (§11.1) Pattern matching algorithms � Brute-force algorithm (§11.2.1) � Boyer-Moore algorithm (§11.2.2) � Knuth-Morris-Pratt algorithm (§11.2.3) Pattern Matching 2

  3. Strings A string is a sequence of Let P be a string of size m characters � A substring P [ i .. j ] of P is the subsequence of P consisting of Examples of strings: the characters with ranks � C++ program between i and j � HTML document � A prefix of P is a substring of � DNA sequence the type P [0 .. i ] � Digitized image � A suffix of P is a substring of An alphabet Σ is the set of the type P [ i ..m − 1] Given strings T (text) and P possible characters for a (pattern), the pattern matching family of strings problem consists of finding a Example of alphabets: substring of T equal to P � ASCII (used by C and C++) Applications: � Unicode (used by Java) � Text editors � {0, 1} � Search engines � {A, C, G, T} � Biological research Pattern Matching 3

  4. Brute-Force Algorithm Algorithm BruteForceMatch ( T, P ) The brute-force pattern Input text T of size n and pattern matching algorithm compares P of size m the pattern P with the text T Output starting index of a for each possible shift of P substring of T equal to P or − 1 relative to T , until either if no such substring exists � a match is found, or for i ← 0 to n − m � all placements of the pattern { test shift i of the pattern } have been tried j ← 0 Brute-force pattern matching while j < m ∧ T [ i + j ] = P [ j ] runs in time O ( nm ) j ← j + 1 Example of worst case: if j = m � T = aaa … ah return i {match at i } � P = aaah � may occur in images and else DNA sequences break while loop {mismatch} � unlikely in English text return -1 {no match anywhere} Pattern Matching 4

  5. Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T [ i ] = c � If P contains c , shift P to align the last occurrence of c in P with T [ i ] � Else, shift P to align P [0] with T [ i + 1] Example a p a t t e r n m a t c h i n g a l g o r i t h m 1 3 5 11 10 9 8 7 r i t h m r i t h m r i t h m r i t h m 2 4 6 r i t h m r i t h m r i t h m Pattern Matching 5

  6. Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet Σ to build the last-occurrence function L mapping Σ to integers, where L ( c ) is defined as � the largest index i such that P [ i ] = c or � − 1 if no such index exists Example: c a b c d � Σ = { a, b, c, d } − 1 L ( c ) 4 5 3 � P = abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O ( m + s ) , where m is the size of P and s is the size of Σ Pattern Matching 6

  7. The Boyer-Moore Algorithm Case 1: j ≤ 1 + l Algorithm BoyerMooreMatch ( T, P, Σ ) a . . . . . . . . . . . . L ← lastOccurenceFunction ( P, Σ ) i i ← m − 1 j ← m − 1 b a . . . . repeat j l m − j if T [ i ] = P [ j ] if j = 0 b a . . . . return i { match at i } else j i ← i − 1 Case 2: 1 + l ≤ j j ← j − 1 a else . . . . . . . . . . . . i { character-jump } l ← L [ T [ i ]] a b . . . . i ← i + m – min( j , 1 + l ) l j j ← m − 1 m − (1 + l ) until i > n − 1 a b return − 1 { no match } . . . . 1 + l Pattern Matching 7

  8. Example a b a c a a b a d c a b a c a b a a b b 1 a b a c a b 4 3 2 13 12 11 10 9 8 a b a c a b a b a c a b 5 7 a b a c a b a b a c a b 6 a b a c a b Pattern Matching 8

  9. Analysis Boyer-Moore’s algorithm a a a a a a a a a runs in time O ( nm + s ) Example of worst case: 6 5 4 3 2 1 b a a a a a � T = aaa … a � P = baaa 12 11 10 9 8 7 The worst case may occur in b a a a a a images and DNA sequences but is unlikely in English text 18 17 16 15 14 13 b a a a a a Boyer-Moore’s algorithm is significantly faster than the 24 23 22 21 20 19 brute-force algorithm on b a a a a a English text Pattern Matching 9

  10. The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right , but shifts a b a a b x . . . . . . . the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, a b a a b a what is the most we can shift j the pattern so as to avoid redundant comparisons? a b a a b a Answer: the largest prefix of P [0.. j ] that is a suffix of P [1.. j ] No need to Resume repeat these comparing comparisons here Pattern Matching 10

  11. KMP Failure Function Knuth-Morris-Pratt’s 5 j 0 1 2 3 4 algorithm preprocesses the P [ j ] a b a a b a pattern to find matches of 3 F ( j ) 0 0 1 1 2 prefixes of the pattern with the pattern itself The failure function F ( j ) is a b a a b x . . . . . . . defined as the size of the largest prefix of P [0.. j ] that is also a suffix of P [1.. j ] a b a a b a Knuth-Morris-Pratt’s j algorithm modifies the brute- force algorithm so that if a a b a a b a mismatch occurs at P [ j ] ≠ T [ i ] we set j ← F ( j − 1) F ( j − 1) Pattern Matching 11

  12. The KMP Algorithm The failure function can be Algorithm KMPMatch ( T, P ) F ← failureFunction ( P ) represented by an array and i ← 0 can be computed in O ( m ) time j ← 0 At each iteration of the while- while i < n if T [ i ] = P [ j ] loop, either if j = m − 1 � i increases by one, or return i − j { match } � the shift amount i − j else increases by at least one i ← i + 1 (observe that F ( j − 1) < j ) j ← j + 1 else Hence, there are no more if j > 0 than 2 n iterations of the j ← F [ j − 1] while-loop else i ← i + 1 Thus, KMP’s algorithm runs in return − 1 { no match } optimal time O ( m + n ) Pattern Matching 12

  13. Computing the Failure Function The failure function can be represented by an array and Algorithm failureFunction ( P ) can be computed in O ( m ) time F [ 0 ] ← 0 The construction is similar to i ← 1 j ← 0 the KMP algorithm itself while i < m At each iteration of the while- if P [ i ] = P [ j ] loop, either {we have matched j + 1 chars} F [ i ] ← j + 1 � i increases by one, or i ← i + 1 � the shift amount i − j j ← j + 1 increases by at least one else if j > 0 then (observe that F ( j − 1) < j ) {use failure function to shift P } j ← F [ j − 1] Hence, there are no more else than 2 m iterations of the F [ i ] ← 0 { no match } while-loop i ← i + 1 Pattern Matching 13

  14. Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b 5 j 0 1 2 3 4 15 14 16 17 18 19 P [ j ] a b a c a b a b a c a b 2 F ( j ) 0 0 1 0 1 Pattern Matching 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend