string matching
play

String Matching Inge Li Grtz CLRS 32 String Matching String - PowerPoint PPT Presentation

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string T (text) and string P (pattern) over an alphabet . |T| = n, |P| = m. Report all starting positions of occurrences of P in T. P = a b a b


  1. String Matching Inge Li Gørtz CLRS 32

  2. String Matching • String matching problem: • string T (text) and string P (pattern) over an alphabet Σ . • |T| = n, |P| = m. • Report all starting positions of occurrences of P in T. P = a b a b a c a T = b a c b a b a b a b a b a c a b

  3. Strings • ε : empty string • prefix/su ffi x: v=xy: • x prefix of v, if y ≠ ε x is a proper prefix of v • y su ffi x of v, if y ≠ ε x is a proper suufix of v. • Example: S = aabca • The su ffi xes of S are: aabca , abca , bca , ca and a . • The strings abca , bca , ca and a are proper su ffi xes of S.

  4. String Matching • Finite automaton • Knuth-Morris-Pratt (KMP)

  5. A naive string matching algorithm b a c b a b a b a b a b a c a b a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a

  6. Improving the naive algorithm P = a a a b a b a T = a a a b a a a b a b a b a c a b b a a a b a b a

  7. Improving the naive algorithm P = a a a b a b a T = a a a b a a a b a b a b a c a b b a a a b a b a a a a b a b a a a a b a a a a a a b a a a a a a b a b a

  8. Exploiting what we know from pattern P = a b a b a c a T = a b a b a a a b a b a c a What character in the pattern should we check next? T = a b a b a b a b a b a c a What character in the pattern should we check next? T = a b a b a c a b a b a c a What character in the pattern should we check next?

  9. Exploiting what we know from pattern P = a b a b a c a T = x a b a b a a x a b a b a c a What character in the pattern should we compare x to? 2nd a b a b a c a a b a b a c a T = x a b a b a b x a b a b a c a What character in the pattern should we compare x to? 5th a b a b a c a T = x a b a b a c x a b a b a c a What character in the pattern should we compare x to? 7th a b a b a c a

  10. Finite Automaton

  11. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a accepting state starting state a a a b a b a c a b a b

  12. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a accepting state starting state a a a b a b a c a b a b longest prefix of P that is a proper suffix of ‘abaa' Matched until now: a b a a P: a b a b a c a

  13. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a b a b a c a b a b T = b a c b a b a b a b a b a c a b

  14. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘aa’ = ‘a’ a a Matched until now: P: a b a b a c a

  15. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘ac’ = ‘ ’ a c Matched until now: P: a b a b a c a

  16. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘b’? longest prefix of P that is a proper suffix of ‘abb’ = ‘ ’ a b b Matched until now: P: a b a b a c a

  17. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘abc’ = ‘ ’ a b c Matched until now: P: a b a b a c a

  18. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘abaa’ = ‘a’ a b a a Matched until now: P: a b a b a c a

  19. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘abac’ = ‘ ’ a b a c Matched until now: P: a b a b a c a

  20. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘b’? longest prefix of P that is a proper suffix of ‘ababb’ = ‘ ’ a b a b b Matched until now: P: a b a b a c a

  21. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘ababc’ = ‘ ’ a b a b c Matched until now: P: a b a b a c a

  22. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘ababaa’ = ‘a’ a b a b a a Matched until now: P: a b a b a c a

  23. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘b’? longest prefix of P that is a proper suffix of ‘ababaa’ = ‘abab’ a b a b a b Matched until now: P: a b a b a c a

  24. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘b’? longest prefix of P that is a proper suffix of ‘ababacb’ = ‘ ’

  25. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘c’? longest prefix of P that is a proper suffix of ‘ababacc’ = ‘ ’

  26. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b read ‘a’? longest prefix of P that is a proper suffix of ‘ababacaa’ = ‘a’

  27. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b read ‘b’? longest prefix of P that is a proper suffix of ‘ababacab’ = ‘ab’

  28. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b longest prefix of P that is a proper suffix of ‘ababacac’ = ‘ ’ read ‘c’?

  29. Finite Automaton • Finite automaton: • Q: finite set of states • q 0 ∈ Q: start state a • A ⊆ Q: set of accepting states a • Σ : finite input alphabet a a b a b a c a • δ : transition function b a b • Matching time: O(n) • Preprocessing time: O(m 3 | Σ |). Can be done in O(m| Σ |). • Total time: O(n + m| Σ |)

  30. KMP

  31. KMP • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a b a b a c a b a b • KMP: Can be seen as finite automaton with failure links : a b a b a c a 6 1 2 3 4 5

  32. KMP • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a suffix of what we have matched until now (ignore the mismatched character). a b a b a c a 1 2 3 4 5 6 longest prefix of P that is a proper suffix of ‘aba'

  33. KMP matching • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a suffix of what we have matched until now. a b a b a c a 1 2 3 4 5 6 T = b a c b a b a b a b a b a c a b

  34. KMP • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a proper suffix of what we have matched until now. • can follow several failure links when matching one character: a b a b a c a 1 2 3 4 5 6 T = a b a b a a

  35. KMP Analysis • Lemma. The running time of KMP matching is O(n). • Each time we follow a forward edge we read a new character of T. • #backward edges followed ≤ #forward edges followed ≤ n. • If in the start state and the character read in T does not match the forward edge, we stay there. • Total time = #non-matched characters in start state + #forward edges followed + #backward edges followed ≤ 2n.

  36. Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. • Computing failure links: Use KMP matching algorithm. longest prefix of P that is a suffix of ‘abab' a b a b a c a 1 2 3 4 5 6 Can be found by using KMP to match ‘bab' a b a b a c a 6 1 2 3 4 5

  37. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6 7 a b a b a c a P =

  38. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b c a a b c 1 2 3 4 5 6 1 2 3 4 5 6 7 a b c a a b c P =

  39. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b c a a b c 1 2 3 4 5 6 1 2 3 4 5 6 7 a b c a a b c P =

  40. KMP • Computing π : As KMP matching algorithm (only need π values that are already computed). • Running time: O(n + m): • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Corollary. Total number of comparisons of characters in the preprocessing of KMP is at most 2m.

  41. KMP: the π array • π array: A representation of the failure links. • Takes up less space than pointers. i 1 2 3 4 5 6 7 π [i] 0 0 1 2 3 0 1 a b a b a c a 6 1 2 3 4 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend