String Matching Algorithm : Design & Analysis [18] In the last - PowerPoint PPT Presentation

String Matching Algorithm : Design & Analysis [18]

In the last class… � Optimal Binary Search Tree � Separating Sequence of Word � Dynamic Programming Algorithms

String Matching � Simple String Matching � KMP Flowchart Construction � Jump at Fail � KMP Scan

String Matching: Problem Description � Search the text T , a string of characters of length n � For the pattern P , a string of characters of length m (usually, m<<n) � The result � If T contains P as a substring, returning the index starting the substring in T � Otherwise: fail

Straightforward Solution p 1 … p k-1 p k … p m P : Next comparison … ? t 1 … t i … t i+k-2 t i+k-1 … t i+m-1 … t n T : Matched window First matched Expanding to right character Note : If it fails to match p k to t i+k-1 , then backtracking Note : If it fails to match p k to t i+k-1 , then backtracking occurs, a cycle of new matching of characters starts from occurs, a cycle of new matching of characters starts from t i+1 .In the worst case, nearly n backtracking occurs and t i+1 .In the worst case, nearly n backtracking occurs and there are nearly m -1 comparisons in one cycle, so Θ ( mn ) there are nearly m -1 comparisons in one cycle, so Θ ( mn )

Brute-Force, Not So Bad as It Looks T P n-m +1 worst-case: m ( n - m +1) sliding window Average-case: (characters of P and T randomly chosen from Σ (| Σ |=d ≥ 2) For a specific window, the expected number of comparison is : m ⎛ ⎞ 1 ⎜ ⎟ matched : m ⎝ ⎠ d ummatched : for the case that the first unmatched character − 1 i ⎛ ⎞ ⎛ − ⎞ 1 1 ⎜ ⎟ ⎜ ⎟ is the th in the window, then, 1 i i ⎝ ⎠ ⎝ ⎠ d d ⎡ ⎤ ⎡ ⎤ − 1 i m i i − − ⎛ ⎞ ⎛ − ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ m m m 1 1 1 1 1 1 d ∑ ∑ + = + + − = ≤ ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ ⎜ ⎟ ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ So, 1 1 ( 1 ) 2 i m i i − − ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 1 ⎢ ⎥ ⎢ ⎥ 1 d d d d d d ⎣ ⎦ ⎣ ⎦ = = 1 1 i i

Disadvantages of Backtracking � More comparisons are needed � Up to m -1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

An Intuitive Finite Automaton for Matching a Given Pattern Why no backtracking? Why no backtracking? Alphabet = { A,B,C } Memorize the prefix. Memorize the prefix. B B , C A A A B C 1 2 3 4 * B , C A start node C stop node Automaton for pattern “ AABC ” matched! Advantage : each character in the text is checked only once Advantage : each character in the text is checked only once Difficulty : Construction of the automaton – too many Difficulty : Construction of the automaton – too many edges(for a large alphabet) to defined and stored edges(for a large alphabet) to defined and stored

Looking at the Automata Again Alphabet = { A,B,C } B B , C A A A B C 1 2 3 4 * B , C A start node C stop node Automaton for pattern “ AABC ” matched! There is only one path to success, However, many paths leading to Fail.

The Knuth-Morris-Pratt Flowchart Success Failure 2 Get next A B A B C B * text char. 1 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 An example: T =“A C A B A A B A B A”, P =“ABABCB” KMP cell number 1 2 1 0 1 2 3 4 2 1 2 3 4 5 3 4 Text being scanned 1 2 2 2 3 4 5 6 6 6 7 8 9 10 10 11 A C C C A B A A A A B A B A A - Success or Failure s f f C s s s f f s s s s f s F get next char.

Matched Frame P : ABABABCB Moving for 4 chars may result in error. T : ... ABABAB x … to be compared next matched frame P : ABABABCB If x is not C T : ... ABABABABCB … P : ABAB ABCB The matched frame move to right for 2 chars, which is equal to moving the pointers backward. T : ... ABABAB x …

Sliding the Matched Frame When dismatching occurs: …… …… p 1 p k-1 p k …… …… …… …… t 1 t i t j-1 t j Matched frame Dismatching Matched frame slides, with its breadth changed as well: p 1 …… …… p r-1 p r As large as As large as p 1 …… p k-r+1 …… p k-1 possible. possible. …… t i …… p j-r+1 …… t j-1 t j …… t 1 New matched frame Next comparison

Which means: Which means: When fail at node k , next Fail Links When fail at node k , next comparison is p k vs. p r comparison is p k vs. p r � Out of each node of KMP flowchart is a fail link, leading to node r , where r is the largest non-negative interger satisfying r < k and p 1 ,…, p r-1 matches p k- r+1 ,…, p k-1 . (stored in fail[ k ]) r pointer for T P forward pointer for P P backward k - r k � Note: r is independent of T .

Computing the Fail Links To be compared Thinking recursively, let fail[k-1]=s: …… p s+1 …… p 1 p s-1 p s Matched …… …… …… p k-2 p k-1 p k …… p m p 1 p k-r+1 To be compared and thinking recursively Case 2: p s ≠ p k-1 Case 1 p 1 … p fail[s]-1 p fail[s] p s = p k-1 …… p s+1 …… p s p 1 p s-1 fail[k]=s+1 p 1 … p k-r+1 …… p k …… p m p k-2 p k-1

Recursion on Node fail[ s ] Thinking recursively, at the beginning, s=fail[k-1]: Case 2: p s ≠ p k-1 p s is replaced by p fail[s] , that is, new value assumed for s p 1 … p fail[s]-1 p fail[s] …… p s+1 …… p 1 p s-1 p s p 1 … p k-r+1 …… p k …… p m p k-2 p k-1 Then, proceeding on new s , that is: If case 1 applys ( p s = p k-1 ): fail[k]=s+1, or If case 2 applys ( p s ≠ p k-1 ): another new s

Computing Fail Links: an Example Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next A B A B A B C B * text char. 0 1 2 3 4 5 6 7 8 9 fail[7] : ∵ fail[6]=4, and p 6 = p 4 , ∴ fail[7]=fail[6]+1=5 (case 1) fail[8] : fail[7]=5, but p 7 ≠ p 5 , so, let s=fail[5]=3, but p 7 ≠ p 3 , keeping back, let s=fail[3]=1. Still p 7 ≠ p 1 . Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

Constructing KMP Flowchart Input: P , a string of characters; m , the length of P Output: fail , the array of failure links, filled void kmpSetup ( char [] P, int m, int [] fail) int k, s; fail[1]=0; For loop executes m -1 times, and for (k=2; k ≤ m; k++) For loop executes m -1 times, and while loop executes at most m times while loop executes at most m times s=fail[k-1]; since fail[s] is always less than s. since fail[s] is always less than s. while (s ≥ 1) So, the complexity is roughly O ( m 2 ) if ( p s = = p k-1 ) So, the complexity is roughly O ( m 2 ) break ; s=fail[s]; fail[k]=s+1;

Number of Character Comparisons Success comparison : Success comparison : ≤ 2 m -3 at most once for a specified k , at most once for a specified k , fail[1]=0; totaling at most m -1 totaling at most m -1 for (k=2; k ≤ m; k++) s=fail[k-1]; while (s ≥ 1) Unsuccessful comparison : Unsuccessful comparison : if ( p s = = p k-1 ) Always followed by decreasing of s . Always followed by decreasing of s . break ; Since: s is initialed as 0, Since: s is initialed as 0, s=fail[s]; s increases by one each time s increases by one each time s is never negative fail[k]=s+1; s is never negative So, the counting of decreasing can So, the counting of decreasing can not be larger than that of increasing not be larger than that of increasing These 2 lines combine to increase s by 1, done m -2 times

KMP Scan: the Algorithm Input: P and T , the pattern and text; m , the length of P ; fail : the array of failure links for P . Output: index in T where a copy of P begins, or -1 if no match int kmpScan( char [ ] P , char [ ] T , int m , int [ ] fail ) int match, j,k; //j indexes T , and k indexes P Each time a new match=-1; j=1; k=1; Each time a new cycle begins, cycle begins, while (endText(T,j)= false ) p 1 ,… p k-1 matched p 1 ,… p k-1 matched if (k>m) match=j-m; break ; Matched entirely if (k= =0) j++; k=1; else if ( t j = = p k ) j++; k++; //one character matched else k=fail[k]; //following the failure link return match Executed at most 2n times, why?

Home Assignment � pp.508- � 11.4 � 11.8 � 11.9 � 11.13

String Matching Algorithm : Design & Analysis [18] In the last - PowerPoint PPT Presentation

String Matching Algorithm : Design & Analysis [18] In the last class Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms String Matching Simple String Matching KMP Flowchart

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

The String Class Trace Code Constructing a String String s = "Java"; String

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching II Algorithm : Design & Analysis [19] In the last class Simple String

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

String Matching with Involutions Florin Manea Challenges in Combinatorics on Words April 2013

String Matching: Rabin-Karp Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

Chapter 32: String Matching Fall 2007 Simonas altenis simas@cs.aau.dk Modified by Pierre

Regular Expressions Simple matching and searching String: My name is Claus Regex: My name is

String Matching: Boyer-Moore Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

String Objectives Discuss string handling System.String class

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

CS510 Software Engineering Program Representations Asst. Prof. Mathias Payer Department of

Python Session # 3 By: Saeed Haratian Spring 2016 Outlines Algorithm Flow Chart

Morteza Noferesti Concept of algorithms Understand and use three tools to represent

A closer look at ARM code quality Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

CS 5150 So(ware Engineering 8. Models for Requirements Analysis and SpecificaBon William Y.

Embedded Electronics. Exercise: Structure charts JSP-chart ( J ackson S tructured P rogramming ) is

350102 GENERAL INFORMATION & COMMUNICATION TECHNOLOGY II (GENICT) - DESCRIBING SOFTWARE -

Grbner Bases. Applications in Cryptology Description of the Cipher Families Feistel cipher:

String Matching Algorithm : Design & Analysis [18] In the last - PowerPoint PPT Presentation

String Matching Algorithm : Design & Analysis [18] In the last class Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms String Matching Simple String Matching KMP Flowchart

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching II Algorithm : Design &amp; Analysis [19] In the last class Simple String

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

String Matching with Involutions Florin Manea Challenges in Combinatorics on Words April 2013

String Matching: Rabin-Karp Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

Chapter 32: String Matching Fall 2007 Simonas altenis simas@cs.aau.dk Modified by Pierre

Regular Expressions Simple matching and searching String: My name is Claus Regex: My name is

String Matching: Boyer-Moore Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

String Objectives Discuss string handling System.String class

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

CS510 Software Engineering Program Representations Asst. Prof. Mathias Payer Department of

Python Session # 3 By: Saeed Haratian Spring 2016 Outlines Algorithm Flow Chart

Morteza Noferesti Concept of algorithms Understand and use three tools to represent

A closer look at ARM code quality Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

CS 5150 So(ware Engineering 8. Models for Requirements Analysis and SpecificaBon William Y.

Embedded Electronics. Exercise: Structure charts JSP-chart ( J ackson S tructured P rogramming ) is

350102 GENERAL INFORMATION &amp; COMMUNICATION TECHNOLOGY II (GENICT) - DESCRIBING SOFTWARE -

Grbner Bases. Applications in Cryptology Description of the Cipher Families Feistel cipher:

The String Class Trace Code Constructing a String String s = "Java"; String

String Matching II Algorithm : Design & Analysis [19] In the last class Simple String

350102 GENERAL INFORMATION & COMMUNICATION TECHNOLOGY II (GENICT) - DESCRIBING SOFTWARE -