string matching
play

String Matching Algorithm : Design & Analysis [18] In the last - PowerPoint PPT Presentation

String Matching Algorithm : Design & Analysis [18] In the last class Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms String Matching Simple String Matching KMP Flowchart


  1. String Matching Algorithm : Design & Analysis [18]

  2. In the last class… � Optimal Binary Search Tree � Separating Sequence of Word � Dynamic Programming Algorithms

  3. String Matching � Simple String Matching � KMP Flowchart Construction � Jump at Fail � KMP Scan

  4. String Matching: Problem Description � Search the text T , a string of characters of length n � For the pattern P , a string of characters of length m (usually, m<<n) � The result � If T contains P as a substring, returning the index starting the substring in T � Otherwise: fail

  5. Straightforward Solution p 1 … p k-1 p k … p m P : Next comparison … ? t 1 … t i … t i+k-2 t i+k-1 … t i+m-1 … t n T : Matched window First matched Expanding to right character Note : If it fails to match p k to t i+k-1 , then backtracking Note : If it fails to match p k to t i+k-1 , then backtracking occurs, a cycle of new matching of characters starts from occurs, a cycle of new matching of characters starts from t i+1 .In the worst case, nearly n backtracking occurs and t i+1 .In the worst case, nearly n backtracking occurs and there are nearly m -1 comparisons in one cycle, so Θ ( mn ) there are nearly m -1 comparisons in one cycle, so Θ ( mn )

  6. Brute-Force, Not So Bad as It Looks T P n-m +1 worst-case: m ( n - m +1) sliding window Average-case: (characters of P and T randomly chosen from Σ (| Σ |=d ≥ 2) For a specific window, the expected number of comparison is : m ⎛ ⎞ 1 ⎜ ⎟ matched : m ⎝ ⎠ d ummatched : for the case that the first unmatched character − 1 i ⎛ ⎞ ⎛ − ⎞ 1 1 ⎜ ⎟ ⎜ ⎟ is the th in the window, then, 1 i i ⎝ ⎠ ⎝ ⎠ d d ⎡ ⎤ ⎡ ⎤ − 1 i m i i − − ⎛ ⎞ ⎛ − ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ m m m 1 1 1 1 1 1 d ∑ ∑ + = + + − = ≤ ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ ⎜ ⎟ ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ So, 1 1 ( 1 ) 2 i m i i − − ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 1 ⎢ ⎥ ⎢ ⎥ 1 d d d d d d ⎣ ⎦ ⎣ ⎦ = = 1 1 i i

  7. Disadvantages of Backtracking � More comparisons are needed � Up to m -1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

  8. An Intuitive Finite Automaton for Matching a Given Pattern Why no backtracking? Why no backtracking? Alphabet = { A,B,C } Memorize the prefix. Memorize the prefix. B B , C A A A B C 1 2 3 4 * B , C A start node C stop node Automaton for pattern “ AABC ” matched! Advantage : each character in the text is checked only once Advantage : each character in the text is checked only once Difficulty : Construction of the automaton – too many Difficulty : Construction of the automaton – too many edges(for a large alphabet) to defined and stored edges(for a large alphabet) to defined and stored

  9. Looking at the Automata Again Alphabet = { A,B,C } B B , C A A A B C 1 2 3 4 * B , C A start node C stop node Automaton for pattern “ AABC ” matched! There is only one path to success, However, many paths leading to Fail.

  10. The Knuth-Morris-Pratt Flowchart Success Failure 2 Get next A B A B C B * text char. 1 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 An example: T =“A C A B A A B A B A”, P =“ABABCB” KMP cell number 1 2 1 0 1 2 3 4 2 1 2 3 4 5 3 4 Text being scanned 1 2 2 2 3 4 5 6 6 6 7 8 9 10 10 11 A C C C A B A A A A B A B A A - Success or Failure s f f C s s s f f s s s s f s F get next char.

  11. Matched Frame P : ABABABCB Moving for 4 chars may result in error. T : ... ABABAB x … to be compared next matched frame P : ABABABCB If x is not C T : ... ABABABABCB … P : ABAB ABCB The matched frame move to right for 2 chars, which is equal to moving the pointers backward. T : ... ABABAB x …

  12. Sliding the Matched Frame When dismatching occurs: …… …… p 1 p k-1 p k …… …… …… …… t 1 t i t j-1 t j Matched frame Dismatching Matched frame slides, with its breadth changed as well: p 1 …… …… p r-1 p r As large as As large as p 1 …… p k-r+1 …… p k-1 possible. possible. …… t i …… p j-r+1 …… t j-1 t j …… t 1 New matched frame Next comparison

  13. Which means: Which means: When fail at node k , next Fail Links When fail at node k , next comparison is p k vs. p r comparison is p k vs. p r � Out of each node of KMP flowchart is a fail link, leading to node r , where r is the largest non-negative interger satisfying r < k and p 1 ,…, p r-1 matches p k- r+1 ,…, p k-1 . (stored in fail[ k ]) r pointer for T P forward pointer for P P backward k - r k � Note: r is independent of T .

  14. Computing the Fail Links To be compared Thinking recursively, let fail[k-1]=s: …… p s+1 …… p 1 p s-1 p s Matched …… …… …… p k-2 p k-1 p k …… p m p 1 p k-r+1 To be compared and thinking recursively Case 2: p s ≠ p k-1 Case 1 p 1 … p fail[s]-1 p fail[s] p s = p k-1 …… p s+1 …… p s p 1 p s-1 fail[k]=s+1 p 1 … p k-r+1 …… p k …… p m p k-2 p k-1

  15. Recursion on Node fail[ s ] Thinking recursively, at the beginning, s=fail[k-1]: Case 2: p s ≠ p k-1 p s is replaced by p fail[s] , that is, new value assumed for s p 1 … p fail[s]-1 p fail[s] …… p s+1 …… p 1 p s-1 p s p 1 … p k-r+1 …… p k …… p m p k-2 p k-1 Then, proceeding on new s , that is: If case 1 applys ( p s = p k-1 ): fail[k]=s+1, or If case 2 applys ( p s ≠ p k-1 ): another new s

  16. Computing Fail Links: an Example Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next A B A B A B C B * text char. 0 1 2 3 4 5 6 7 8 9 fail[7] : ∵ fail[6]=4, and p 6 = p 4 , ∴ fail[7]=fail[6]+1=5 (case 1) fail[8] : fail[7]=5, but p 7 ≠ p 5 , so, let s=fail[5]=3, but p 7 ≠ p 3 , keeping back, let s=fail[3]=1. Still p 7 ≠ p 1 . Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

  17. Constructing KMP Flowchart Input: P , a string of characters; m , the length of P Output: fail , the array of failure links, filled void kmpSetup ( char [] P, int m, int [] fail) int k, s; fail[1]=0; For loop executes m -1 times, and for (k=2; k ≤ m; k++) For loop executes m -1 times, and while loop executes at most m times while loop executes at most m times s=fail[k-1]; since fail[s] is always less than s. since fail[s] is always less than s. while (s ≥ 1) So, the complexity is roughly O ( m 2 ) if ( p s = = p k-1 ) So, the complexity is roughly O ( m 2 ) break ; s=fail[s]; fail[k]=s+1;

  18. Number of Character Comparisons Success comparison : Success comparison : ≤ 2 m -3 at most once for a specified k , at most once for a specified k , fail[1]=0; totaling at most m -1 totaling at most m -1 for (k=2; k ≤ m; k++) s=fail[k-1]; while (s ≥ 1) Unsuccessful comparison : Unsuccessful comparison : if ( p s = = p k-1 ) Always followed by decreasing of s . Always followed by decreasing of s . break ; Since: s is initialed as 0, Since: s is initialed as 0, s=fail[s]; s increases by one each time s increases by one each time s is never negative fail[k]=s+1; s is never negative So, the counting of decreasing can So, the counting of decreasing can not be larger than that of increasing not be larger than that of increasing These 2 lines combine to increase s by 1, done m -2 times

  19. KMP Scan: the Algorithm Input: P and T , the pattern and text; m , the length of P ; fail : the array of failure links for P . Output: index in T where a copy of P begins, or -1 if no match int kmpScan( char [ ] P , char [ ] T , int m , int [ ] fail ) int match, j,k; //j indexes T , and k indexes P Each time a new match=-1; j=1; k=1; Each time a new cycle begins, cycle begins, while (endText(T,j)= false ) p 1 ,… p k-1 matched p 1 ,… p k-1 matched if (k>m) match=j-m; break ; Matched entirely if (k= =0) j++; k=1; else if ( t j = = p k ) j++; k++; //one character matched else k=fail[k]; //following the failure link return match Executed at most 2n times, why?

  20. Home Assignment � pp.508- � 11.4 � 11.8 � 11.9 � 11.13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend