Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 - PowerPoint PPT Presentation

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern Matching 1

Outline and Reading Strings (§11.1) Pattern matching algorithms � Brute-force algorithm (§11.2.1) � Boyer-Moore algorithm (§11.2.2) � Knuth-Morris-Pratt algorithm (§11.2.3) Pattern Matching 2

Strings A string is a sequence of Let P be a string of size m characters � A substring P [ i .. j ] of P is the subsequence of P consisting of Examples of strings: the characters with ranks � C++ program between i and j � HTML document � A prefix of P is a substring of � DNA sequence the type P [0 .. i ] � Digitized image � A suffix of P is a substring of An alphabet Σ is the set of the type P [ i ..m − 1] Given strings T (text) and P possible characters for a (pattern), the pattern matching family of strings problem consists of finding a Example of alphabets: substring of T equal to P � ASCII (used by C and C++) Applications: � Unicode (used by Java) � Text editors � {0, 1} � Search engines � {A, C, G, T} � Biological research Pattern Matching 3

Brute-Force Algorithm Algorithm BruteForceMatch ( T, P ) The brute-force pattern Input text T of size n and pattern matching algorithm compares P of size m the pattern P with the text T Output starting index of a for each possible shift of P substring of T equal to P or − 1 relative to T , until either if no such substring exists � a match is found, or for i ← 0 to n − m � all placements of the pattern { test shift i of the pattern } have been tried j ← 0 Brute-force pattern matching while j < m ∧ T [ i + j ] = P [ j ] runs in time O ( nm ) j ← j + 1 Example of worst case: if j = m � T = aaa … ah return i {match at i } � P = aaah � may occur in images and else DNA sequences break while loop {mismatch} � unlikely in English text return -1 {no match anywhere} Pattern Matching 4

Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T [ i ] = c � If P contains c , shift P to align the last occurrence of c in P with T [ i ] � Else, shift P to align P [0] with T [ i + 1] Example a p a t t e r n m a t c h i n g a l g o r i t h m 1 3 5 11 10 9 8 7 r i t h m r i t h m r i t h m r i t h m 2 4 6 r i t h m r i t h m r i t h m Pattern Matching 5

Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet Σ to build the last-occurrence function L mapping Σ to integers, where L ( c ) is defined as � the largest index i such that P [ i ] = c or � − 1 if no such index exists Example: c a b c d � Σ = { a, b, c, d } − 1 L ( c ) 4 5 3 � P = abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O ( m + s ) , where m is the size of P and s is the size of Σ Pattern Matching 6

The Boyer-Moore Algorithm Case 1: j ≤ 1 + l Algorithm BoyerMooreMatch ( T, P, Σ ) a . . . . . . . . . . . . L ← lastOccurenceFunction ( P, Σ ) i i ← m − 1 j ← m − 1 b a . . . . repeat j l m − j if T [ i ] = P [ j ] if j = 0 b a . . . . return i { match at i } else j i ← i − 1 Case 2: 1 + l ≤ j j ← j − 1 a else . . . . . . . . . . . . i { character-jump } l ← L [ T [ i ]] a b . . . . i ← i + m – min( j , 1 + l ) l j j ← m − 1 m − (1 + l ) until i > n − 1 a b return − 1 { no match } . . . . 1 + l Pattern Matching 7

Example a b a c a a b a d c a b a c a b a a b b 1 a b a c a b 4 3 2 13 12 11 10 9 8 a b a c a b a b a c a b 5 7 a b a c a b a b a c a b 6 a b a c a b Pattern Matching 8

Analysis Boyer-Moore’s algorithm a a a a a a a a a runs in time O ( nm + s ) Example of worst case: 6 5 4 3 2 1 b a a a a a � T = aaa … a � P = baaa 12 11 10 9 8 7 The worst case may occur in b a a a a a images and DNA sequences but is unlikely in English text 18 17 16 15 14 13 b a a a a a Boyer-Moore’s algorithm is significantly faster than the 24 23 22 21 20 19 brute-force algorithm on b a a a a a English text Pattern Matching 9

The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right , but shifts a b a a b x . . . . . . . the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, a b a a b a what is the most we can shift j the pattern so as to avoid redundant comparisons? a b a a b a Answer: the largest prefix of P [0.. j ] that is a suffix of P [1.. j ] No need to Resume repeat these comparing comparisons here Pattern Matching 10

KMP Failure Function Knuth-Morris-Pratt’s 5 j 0 1 2 3 4 algorithm preprocesses the P [ j ] a b a a b a pattern to find matches of 3 F ( j ) 0 0 1 1 2 prefixes of the pattern with the pattern itself The failure function F ( j ) is a b a a b x . . . . . . . defined as the size of the largest prefix of P [0.. j ] that is also a suffix of P [1.. j ] a b a a b a Knuth-Morris-Pratt’s j algorithm modifies the brute- force algorithm so that if a a b a a b a mismatch occurs at P [ j ] ≠ T [ i ] we set j ← F ( j − 1) F ( j − 1) Pattern Matching 11

The KMP Algorithm The failure function can be Algorithm KMPMatch ( T, P ) F ← failureFunction ( P ) represented by an array and i ← 0 can be computed in O ( m ) time j ← 0 At each iteration of the while- while i < n if T [ i ] = P [ j ] loop, either if j = m − 1 � i increases by one, or return i − j { match } � the shift amount i − j else increases by at least one i ← i + 1 (observe that F ( j − 1) < j ) j ← j + 1 else Hence, there are no more if j > 0 than 2 n iterations of the j ← F [ j − 1] while-loop else i ← i + 1 Thus, KMP’s algorithm runs in return − 1 { no match } optimal time O ( m + n ) Pattern Matching 12

Computing the Failure Function The failure function can be represented by an array and Algorithm failureFunction ( P ) can be computed in O ( m ) time F [ 0 ] ← 0 The construction is similar to i ← 1 j ← 0 the KMP algorithm itself while i < m At each iteration of the while- if P [ i ] = P [ j ] loop, either {we have matched j + 1 chars} F [ i ] ← j + 1 � i increases by one, or i ← i + 1 � the shift amount i − j j ← j + 1 increases by at least one else if j > 0 then (observe that F ( j − 1) < j ) {use failure function to shift P } j ← F [ j − 1] Hence, there are no more else than 2 m iterations of the F [ i ] ← 0 { no match } while-loop i ← i + 1 Pattern Matching 13

Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b 5 j 0 1 2 3 4 15 14 16 17 18 19 P [ j ] a b a c a b a b a c a b 2 F ( j ) 0 0 1 0 1 Pattern Matching 14

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 - PowerPoint PPT Presentation

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern Matching 1 Outline and Reading Strings (11.1) Pattern matching algorithms Brute-force algorithm (11.2.1) Boyer-Moore algorithm (11.2.2)

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

LPEG: a new approach to pattern LPEG: a new approach to pattern matching in Lua matching in Lua

Pattern matching and lexing Informatics 2A: Lecture 6 John Longley School of Informatics

Simpler and efficient LZW-compressed multiple pattern matching Pawe Gawrychowski July 4, 2012

Quantum pattern matching fast on average Ashley Montanaro Department of Computer Science,

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Concurrent Pattern Matching: combining discovery, privacy and symmetry using pattern matching

CS 126 Lecture T1: Pattern Matching Outline Introduction Pattern matching in Unix

Pattern-Matching Spi-Calculus A Type System for Cryptographic Protocols Christian Haack and Alan

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Short Variable Length Domain Extenders With Beyond Birthday Bound Security Yu Long Chen 1 Bart

Reasoning in Abella about Structural Operational Semantics Specifications Andrew Gacek 1 Dale

Pure Type Systems without Explicit Contexts Robbert Krebbers Joint work with Herman Geuvers,

Correctness-by-Construction in Stringology Bruce W. Watson FASTAR Research Group, Stellenbosch

Theory I Algorithm Design and Analysis (10 - Text search, part 1) Prof. Dr. Th. Ottmann 1 Text

Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures W. Litwin

Run Time Approximation of Non-blocking Service Rates for Streaming Systems Jonathan Beard and