Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1

Outline and Reading Strings and Pattern Matching (§9.1) Tries (§9.2) Text Compression (§9.3) Optional: Text Similarity (§9.4). No Slides. 10/16/2015 3:40 PM Text Processing 2

Texts & Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b 10/16/2015 3:40 PM Text Processing 3

Strings Let P be a string of size m A string is a sequence of characters A substring P [ i .. j ] of P is the  subsequence of P consisting of Examples of strings: the characters with ranks Java program  between i and j HTML document  A prefix of P is a substring of  DNA sequence  the type P [0 .. i ] Digitized image  A suffix of P is a substring of  An alphabet Σ is the set of the type P [ i ..m − 1] Given strings T (text) and P possible characters for a (pattern), the pattern matching family of strings problem consists of finding a Example of alphabets: substring of T equal to P ASCII  Applications: Unicode  Text editors  { 0, 1}  Search engines  { A, C, G, T}  Biological research  10/16/2015 3:40 PM Text Processing 4

Brute-Force Algorithm Algorithm BruteForceMatch ( T, P ) The brute-force pattern Input text T of size n and pattern matching algorithm compares P of size m the pattern P with the text T Output starting index of a for each possible shift of P substring of T equal to P or − 1 relative to T , until either if no such substring exists a match is found, or  for i ← 0 to n − m all placements of the pattern  { test shift i of the pattern } have been tried j ← 0 Brute-force pattern matching while j < m ∧ T [ i + j ] = P [ j ] runs in time O ( nm ) j ← j + 1 Example of worst case: if j = m T = aaa … ah  P = aaah return i {match at i }  may occur in images and else  DNA sequences break while loop {mismatch} unlikely in English text  return -1 {no match anywhere} 10/16/2015 3:40 PM Text Processing 5

Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T [ i ] = c If P contains c , shift P to align the last occurrence of c in P with T [ i ]  Else, shift P to align P [0] with T [ i + 1]  Example a p a t t e r n m a t c h i n g a l g o r i t h m 1 3 5 11 10 9 8 7 r i t h m r i t h m r i t h m r i t h m 2 4 6 r i t h m r i t h m r i t h m 10/16/2015 3:40 PM Text Processing 6

The Boyer-Moore Algorithm Algorithm BoyerMooreMatch ( T, P, Σ ) L ← lastOccurenceFunction ( P, Σ ) i ← m − 1 j ← m − 1 repeat if T [ i ] = P [ j ] if j = 0 return i { match at i } else i ← i − 1 j ← j − 1 else { character-jump } l ← L [ T [ i ]] i ← i + m – min( j , 1 + l ) j ← m − 1 until i > n − 1 return − 1 { no match } 10/16/2015 3:40 PM Text Processing 7

Example a b a c a a b a d c a b a c a b a a b b 1 a b a c a b 4 3 2 13 12 11 10 9 8 a b a c a b a b a c a b 5 7 a b a c a b a b a c a b 6 a b a c a b 10/16/2015 3:40 PM Text Processing 8

Analysis Boyer-Moore’s algorithm a a a a a a a a a runs in time O ( nm + s ) Example of worst case: 6 5 4 3 2 1 T = aaa … a b a a a a a  P = baaa  12 11 10 9 8 7 The worst case may occur in b a a a a a images and DNA sequences 18 17 16 15 14 13 but is unlikely in English text b a a a a a Boyer-Moore’s algorithm is significantly faster than the 24 23 22 21 20 19 brute-force algorithm on b a a a a a English text 10/16/2015 3:40 PM Text Processing 9

The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right , but shifts a b a a b x . . . . . . . the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, a b a a b a what is the most we can shift j the pattern so as to avoid redundant comparisons? a b a a b a Answer: the largest prefix of P [0.. j ] that is a suffix of P [1.. j ] No need to Resume repeat these comparing comparisons here 10/16/2015 3:40 PM Text Processing 10

KMP Failure Function Knuth-Morris-Pratt’s 5 j 0 1 2 3 4 algorithm preprocesses the P [ j ] a b a a b a pattern to find matches of 3 F ( j ) 0 0 1 1 2 prefixes of the pattern with the pattern itself The failure function F ( j ) is a b a a b x . . . . . . . defined as the size of the largest prefix of P [0.. j ] that is also a suffix of P [1.. j ] a b a a b a Knuth-Morris-Pratt’s j algorithm modifies the brute- force algorithm so that if a a b a a b a mismatch occurs at P [ j ] ≠ T [ i ] we set j ← F ( j − 1) F ( j − 1) 10/16/2015 3:40 PM Text Processing 11

The KMP Algorithm Algorithm KMPMatch ( T, P ) The failure function can be F ← failureFunction ( P ) represented by an array and i ← 0 can be computed in O ( m ) time j ← 0 while i < n At each iteration of the while- if T [ i ] = P [ j ] loop, either if j = m − 1 i increases by one, or  return i − j { match } the shift amount i − j  else i ← i + 1 increases by at least one j ← j + 1 (observe that F ( j − 1) < j ) else Hence, there are no more if j > 0 than 2 n iterations of the j ← F [ j − 1] while-loop else i ← i + 1 Thus, KMP’s algorithm runs in return − 1 { no match } optimal time O ( m + n ) 10/16/2015 3:40 PM Text Processing 12

Computing the Failure Function The failure function can be represented by an array and Algorithm failureFunction ( P ) can be computed in O ( m ) time F [ 0 ] ← 0 i ← 1 The construction is similar to j ← 0 the KMP algorithm itself while i < m if P [ i ] = P [ j ] At each iteration of the while- {we have matched j + 1 chars} loop, either F [ i ] ← j + 1 i increases by one, or  i ← i + 1 the shift amount i − j j ← j + 1  else if j > 0 then increases by at least one (observe that F ( j − 1) < j ) {use failure function to shift P } j ← F [ j − 1] Hence, there are no more else than 2 m iterations of the F [ i ] ← 0 { no match } i ← i + 1 while-loop 10/16/2015 3:40 PM Text Processing 13

Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b 5 j 0 1 2 3 4 14 15 16 17 18 19 P [ j ] a b a c a b a b a c a b 2 F ( j ) 0 0 1 0 1 10/16/2015 3:40 PM Text Processing 14

Tries e i mi nimize ze mize nimize ze nimize ze 10/16/2015 3:40 PM Text Processing 15

Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries  After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern A trie is a compact data structure for representing a set of strings, such as all the words in a text  A tries supports pattern matching queries in time proportional to the pattern size 10/16/2015 3:40 PM Text Processing 16

Standard Trie (1) The standard trie for a set of strings S is an ordered tree such that: Each node but the root is labeled with a character  The children of a node are alphabetically ordered  The paths from the external nodes to the root yield the strings of S  Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 17

Standard Trie (2) A standard trie uses O ( n ) space and supports searches, insertions and deletions in time O ( dm ) , where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 18

Word Matching with a Trie s e e a b e a r ? s e l l s t o c k ! We insert the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 words of the s e e a b u l l ? b u y s t o c k ! text into a 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 trie b i d s t o c k ! b i d s t o c k ! Each leaf 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 stores the h e a r t h e b e l l ? s t o p ! occurrences 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 of the associated b h s word in the text e i u e e t y e a l l a l o d 36 0, 24 47, 58 r r p c l l l 6 69 84 78 30 12 k 17, 40, 51, 62 10/16/2015 3:40 PM Text Processing 19

Compressed Trie A compressed trie has b s internal nodes of degree at least two e id u ell to It is obtained from standard trie by ar ll ll y ck p compressing chains of “redundant” nodes b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 20

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 - PowerPoint PPT Presentation

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings and Pattern Matching (9.1) Tries (9.2) Text Compression (9.3) Optional: Text Similarity (9.4). No Slides. 10/16/2015 3:40 PM Text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CS1100: Computer Science and Its Applications Text Processing Processing Text Excel can be

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

Solving The Words Search Problem Ivan Kazmenko St. Petersburg State University Tuesday, July 5,

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Sorting and Searching by Distribution: From Generic Discrimination to Generic Tries Fritz Henglein

File indexing and searching

Recap CS 525: Advanced Database We have discussed Organization

A Study of Erlang ETS Table Implementations and Performance Or: Judy Arrays Are Amazing Data

Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in