string matching
play

String Matching String matching problem: string T (text) and - PowerPoint PPT Presentation

String Matching String matching problem: string T (text) and string P (pattern) over an alphabet . String Matching |T| = n, |P| = m. Report all starting positions of occurrences of P in T. Inge Li Grtz P = a b a b a c a T


  1. String Matching • String matching problem: • string T (text) and string P (pattern) over an alphabet Σ . String Matching • |T| = n, |P| = m. • Report all starting positions of occurrences of P in T. Inge Li Gørtz P = a b a b a c a T = b a c b a b a b a b a b a c a b CLRS 32 Strings String Matching Suffix of S • ε : empty string • Knuth-Morris-Pratt (KMP) S • prefix/su ffi x: v=xy: • Finite automaton • x prefix of v, if y ≠ ε x is a proper prefix of v Prefix of S • y su ffi x of v, if y ≠ ε x is a proper su ffi x of v. • Example: S = aabca • The su ffi xes of S are: aabca , abca , bca , ca and a . • The strings abca , bca , ca and a are proper su ffi xes of S.

  2. A naive string matching algorithm Improving the naive algorithm P = a a a b a b a T = b a c b a b a b a b a b a c a b a a a b a a a b a b a b a c a b b a b a b a c a a a a b a b a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a T = a a a b a a a b a b a b a c a b b T = a a a b a a a a b a b a a c a b b a a a b a b a a a a b a b a a a a b a b a a a a b a a a a a a b a a a a a a b a b a a a a b a b a a a a b a b a

  3. Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a T = T = a a a b a a a a b a b a a c a b b a a a b a a a a b a b a a c a b b a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a If we matched 5 characters from P and then fail: compare failed character to 2nd character in P If we matched 3 characters If we matched all characters from P and then fail: from P: compare failed character to compare next character to 3nd character in P 2nd character in P Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a a a a b a b a a a a b a b a matched matched 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 #matched #matched if fail 3 2 2 if fail 3 2 2 compare to compare to a a a b a b a 6 1 2 3 4 5 If we matched 5 characters If we matched 3 characters If we matched all characters If we matched 5 characters If we matched 3 characters If we matched all characters from P and then fail: from P and then fail: from P: from T and then fail: from T and then fail: from T: compare failed character to compare next character to compare failed character to compare next character to compare failed character to compare failed character to 2nd character in P 3nd character in P 2nd character in P 2nd character in P 3nd character in P 2nd character in P

  4. Longest suffix of S that is a prefix of P Improving the naive algorithm Improving the naive algorithm S P P = a a a b a b a • KMP: P = aaababa. Longest prefix of P that is a suffix of S a a a b a b a a a a b a b a matched matched 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 #matched #matched if fail 1 1 2 3 1 2 1 2 0 0 1 2 0 1 0 1 if fail go to compare to a a a b a b a a a a b a b a 1 2 3 4 5 6 1 2 3 4 5 6 starting state accepting state If we matched 5 characters If we matched 3 characters If we matched all characters In state i after reading character j of T: from T and then fail: from T and then fail: from T: P[1…i] is the longest prefix of P that is a compare failed character to compare failed character to compare next character to suffix of T[1..j] 2nd character in P 3nd character in P 2nd character in P Improving the naive algorithm KMP • KMP: P = aaababa. • KMP: Can be seen as finite automaton with failure links : • Failure link: longest prefix of P that is a proper suffix of what we have matched until now. a a a b a b a matched 0 1 2 3 4 5 6 7 #matched • In state i after reading T[j]: P[1..i] is the longest prefix of P that is a suffix of T[1…j]. 0 0 1 2 0 1 0 1 if fail go to • Can follow several failure links when matching one character: a a a b a b a a b a b a c a 6 6 1 2 3 4 5 1 2 3 4 5 T = a b a b a a • Matching: T = a a a b a a a b a b a a

  5. KMP Analysis KMP Analysis • Analysis. |T| = n, |P| = m. • Lemma. The running time of KMP matching is O(n). • How many times can we follow a forward edge? • Each time we follow a forward edge we read a new character of T. • How many backward edges can we follow (compare to forward edges)? • #backward edges followed ≤ #forward edges followed ≤ n. • Total number of edges we follow? • If in the start state and the character read in T does not match the forward edge, we stay there. • What else do we use time for? • Total time = #non-matched characters in start state + #forward edges followed + #backward edges followed ≤ 2n. Computation of failure links Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. matched until now. • Computing failure links: Use KMP matching algorithm. • Computing failure links: Use KMP matching algorithm. longest prefix of P that is a proper suffix of ‘abab' longest prefix of P that is a suffix of ‘bab' a b a b a c a a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6

  6. Computation of failure links Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have • Computing failure links: As KMP matching algorithm (only need failure links matched until now. that are already computed). • Computing failure links: Use KMP matching algorithm. • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. longest prefix of P that is a suffix of ‘bab' a b a b a c a a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6 Can be found by using KMP to match ‘bab' a b a b a c a 1 2 3 4 5 6 7 Need to match: a, ab, aba, 1 2 3 4 5 6 abab, ababa, ababac, a b a b a c a P = ababaca KMP KMP • Computing π : As KMP matching algorithm (only need π values that are • Computing π : As KMP matching algorithm (only need π values that are already computed). already computed). • Running time: O(n + m): • Running time: O(n + m): • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Corollary. Total number of comparisons of characters in the preprocessing • Corollary. Total number of comparisons of characters in the preprocessing of KMP is at most 2m. of KMP is at most 2m.

  7. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a Finite Automaton accepting state starting state a a a b a b a c a b a b Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a accepting state starting state a a a b a b a c a a b a b a c a b a a b read ‘a’? longest prefix of P that is a proper suffix of ‘aa’ = ‘a’ longest prefix of P that is a proper suffix of ‘abaa' a a Matched until now: Matched until now: a b a a P: a b a b a c a P: a b a b a c a

  8. Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a b a b a c a a a read ‘c’? longest prefix of P that is a proper suffix of ‘ac’ = ‘ ’ read ‘b’? longest prefix of P that is a proper suffix of ‘abb’ = ‘ ’ a c a b b Matched until now: Matched until now: P: a b a b a c a P: a b a b a c a Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a b a b a c a a a read ‘c’? longest prefix of P that is a proper suffix of ‘abc’ = ‘ ’ read ‘a’? longest prefix of P that is a proper suffix of ‘abaa’ = ‘a’ a b c a b a a Matched until now: Matched until now: P: a b a b a c a P: a b a b a c a

  9. Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a a b a b a c a a b a b a c a b a a b read ‘c’? longest prefix of P that is a proper suffix of ‘abac’ = ‘ ’ T = b a c b a b a b a b a b a c a b a b a c Matched until now: P: a b a b a c a Finite Automaton • Finite automaton: • Q: finite set of states • q 0 ∈ Q: start state a • A ⊆ Q: set of accepting states a • Σ : finite input alphabet a a b a b a c a • δ : transition function b a b • Matching time: O(n) • Preprocessing time: O(m 3 | Σ |). Can be done in O(m| Σ |) using KMP . • Total time: O(n + m| Σ |)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend