Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in range(m): University of California, Berkeley if pattern[j] != string[i + j] is_match = False break if is_match: return i What if you were looking for a pattern ? Like an email address? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23 Background Background Text processing has been at the heart of computer science since the 1950s Most of you will probably graduate without learning string processing. Regular languages: 1950s (Kleene) Instead, you’ll learn how to process images and Big Data™. Context-free languages (CFLs): 1950s (Chomsky) Which makes me sad. :( You should know how to solve solved problems! Regular expressions (regexes) & automata: 1960s (Thompson) Learn & use 100%-accurate algorithms before 85%-accurate ones! LR parsing ( l eft-to-right, r ightmost-derivation): 1960s (Knuth) O ( mn )-time str.find(substring) is bad ! You can do much better: Context-free parsers: 1960s (Earley) Good algorithms finish in O ( m + n ) time & space (e.g. Z algorithm) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s The best/coolest finish in O ( m + n ) time but O (1) space !!! Periods & critical factorizations: 1970s (Cesari-Vincent) So, today, I’ll teach a bit about string processing. :) [...] Critical factorizations in linear complexity: 2016 (Kosolobov) You can learn more in CS 164, CS 176, etc. (Have fun!) Research is still ongoing ...apparently more in Europe? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23 Formal Languages Formal Grammars Languages can be infinite, so we can’t always list all the strings in them. In formal language theory: We therefore use grammars to describe languages. Alphabet : any set (usually a character set , like English or ASCII) For example, this grammar describes L = { “”, “ hi ”, “ hihi ”, . . . } : → Often denoted by Σ S → T Letter : an element in the given alphabet , e.g. “ x ” T → ε String (or word ): finite sequence of letters , e.g. “ hi ” T → T "h" "i" Language : a set of strings , e.g. { “ a ”, “ aa ”, “ aaa ”, . . . } We call S a nonterminal symbol and “ h ” a terminal symbol (i.e., letter). → Often denoted by L Each line is a production rule , producing a sentential form on the right. We might omit the quotes/braces, so we’ll use the following denotations: To make life easier, we’ll denote these by uppercase and lowercase respectively, omitting quotes and spaces when convenient. ε : empty string (i.e., “”) We then merge and simplify rules via the pipe (OR) symbol: ∅ : empty language (i.e., empty set {} ) S → S hi | ε Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23 Regular Languages Regular Grammars A regular grammar is a grammar in which all productions have at most The following are regular languages over the alphabet Σ: one nonterminal symbol, all of which appear on either the left or the right. ∅ In other words, this is a regular grammar: { ε } S → A b c A → S a | ε { σ } ∀ σ ∈ Σ This is not a regular grammar (but it is linear and context-free ): The union A ∪ B of any regular languages A and B over Σ S → A b c The concatenation AB of any regular languages A and B over Σ A → a S | ε The repetition (Kleene star) A ∗ of any regular language A over Σ A ∗ = { ε } ∪ A ∪ AA ∪ AAA ∪ . . . and neither is this (it is context-sensitive ): Notice that all finite languages are regular, but not all infinite languages. S → S s | ε S s → S t Regular languages do not allow arbitrary “nesting” (e.g. parens). A language is regular iff it can be described by a regular grammar. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

Regular Expressions Regular Expressions A regular expression is an easier way to describe a regular language. Regular expressions ( regexes ) are equivalent to regular grammars 1 , e.g. It’s essentially a pattern for describing a regular language. For example, in [abcw-z] * (1+2|3)?4 \ ? , we have: Y � �� [abcw-z] * 4 \ ? (1+ 2|3)? [abcw-z] (a character set ) means “either a , b , c , w , x , y , or z ”. � �� X � �� Asterisk (a.k.a. “Kleene star”, a quantifier ) means “zero or more” Z Plus (another quantifier) means “one or more” is equivalent to Question mark (another quantifier) means “at most one” S → Z 4 ? Z → Y 2 | X 3 | ε Backslash (“escape”) before a special character means that character Y → Y 1 | X 1 Pipe (the OR symbol | ) means “either”, and parentheses group X → X a | X b | X c | X w | X x | X y | X z | ε So this matches zero or more of a, b, c, w, x, y, z, followed by either Here, the regex is more compact. Sometimes, the grammar is smaller. nothing or by 3 or by 1’s followed by 2, followed by 4 and a question mark. 1 If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23 Regular Expressions Regular Expressions Python has a regex engine to find text matching a regex: Million-dollar question: >>> import re How do you find text matching a regex? >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') Two steps: >>> m 1 Parse the regex (pattern) to “understand” its structure <re.Match object; span=( 0 , 24 ), match='hello cs61a@berkeley.edu'> 2 Use the regex to parse the actual text (corpus) >>> m.groups() It turns out that: ('cs61a', 'berkeley.edu') 1 Step 1 is theoretically harder, but practically easier. Notice that these could all be handled by re.match : (This can be done similarly to how you parsed Scheme.) Substring search ( str.find ) 2 Step 2 is theoretically easier, but practically harder. Subsequence search ( re.match(".*b.*b", "abbc") ) This is because we need parsing the corpus to be fast . The grep tool (from ed ’s g/re/p = global/regex/print ) does this for files. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23 Regular Expressions Finite Automata A finite automaton (FA) consists of the following (example below) 2 : How do you solve each step? An input alphabet Σ ( { 0 , 1 } here) Both steps are often done using “recursive-descent”—similarly to how your A finite set of states S ( { s 0 , s 1 , s 2 } here) Scheme parser parsed its input. An initial state s 0 ∈ S ( s 0 here) Basically: try every possibility recursively. “Backtrack” on failure to try something else. A set of accepting (or final ) states F ⊂ S ( { s 2 } here) A transition function δ : S × Σ → 2 S (the arrows here) Problem: Recursive-descent can take exponential time ! Example (where “ a { 3 } ” is shorthand for “ aaa ”): 0 1 1 >>> re.match("(a?){25}a{25}", "a" * 25 ) s 0 s 1 s 2 Can we hope to parse corpora in time linear to their lengths? Yes , using finite automata. 1 0 0 2 Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23 Finite Automata Finite Automata Finite automata are language recognizers : you feed a string as an input, and if it accepts the input string, the string is in its language. 3 Notice the transition function δ outputs a subset of states. In particular: = ⇒ Finite automata recognize regular languages , and nothing else ! In a deterministic finite automaton (DFA), the transition function always Therefore, we can: outputs a set with exactly one state (a singleton ). 1 Convert regex pattern to FA i.e., in a DFA, the next state is determined by the input & current state. (i.e., every state has exactly 1 arrow leaving it for each possible input.) 2 Feed corpus to FA in linear time ! In a nondeterministic finite automaton (NFA), the above is not true. 3 ... 4 Profit! But how can we do this? 3 Pumping lemma : A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in

Sketch Model Review MotoThresher Empowering Tanzanian Farmers Motivation Motivation

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&M University Motivation

Bringing Portraits to Life CS448V: Lecture 13 Motivation Motivation Motivation Bring Your

Motivation: Theory & practice 2017-18 I MPORTANCE OF MOTIVATION Employees may lack

5. Motivation Motivation: Big Questions Where does motivation come from? Can

Indoor Places Lukas Kuster Motivation GPS for localization [7] 2 Motivation Indoor

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

MOTIVATION MOTIVATION Dr. M. Thenmozhi Professor Department of Management Studies Indian

Video Analytics Xavier Gir-i-Nieto Motivation 2 Motivation 3 Motivation 4 Outline 1.

MOTIVATION Watch this video on intrinsic versus extrinsic motivation Value x Expectation (of

Learner Motivation Motivational Self-Reflection Self-Reflection Time Travel Think about a time

Motivation What is Motivation? How motivated are you now? What are your thoughts as you enter

RedGate - Enterprise MSE Project - Phase I Integration Server Motivation 2 Motivation 2

Comp/Phys/Mtsc 715 Lecture 2: Motivation and Toolkits 1/13/2011 Motivation and Toolkits

Recent work in Truncated Statistics Andrew Ilyas Motivation: Poincar and the Baker

Comp/Phys/Mtsc 715 Lecture 2: Motivation and Toolkits 1/14/2014 Motivation and Toolkits

Degree Map Empowers Students Set education and career goals Create a clear, personalized

Class 26 4 November 2019 Queues and amortized analysis Rackette operations in detail Abstract

Trademark and Unfair Competition Law Slides 20: Dilution LAWS 7341-001 Prof. Kristelia Garca

Exam Preparation Advice (for ANLP and in general) Shay Cohen (based on slides by Sharon

How Does Questionnaire Design Affect Party ID? Kyley McGeeney Senior Director of Survey Methods

www.evalu ate.org Slides, video, and handout: www.evalu ate.org/webinars/feb 19/ 1

Cognitive Interviewing Debbie Collins What is cognitive interviewing? Cognitive interviewing

Queuing Theory Equations Definition = Arrival Rate = Service Rate = / C = Number

Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in

Sketch Model Review MotoThresher Empowering Tanzanian Farmers Motivation Motivation

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&amp;M University Motivation

Bringing Portraits to Life CS448V: Lecture 13 Motivation Motivation Motivation Bring Your

Motivation: Theory &amp; practice 2017-18 I MPORTANCE OF MOTIVATION Employees may lack

5. Motivation Motivation: Big Questions Where does motivation come from? Can

Indoor Places Lukas Kuster Motivation GPS for localization [7] 2 Motivation Indoor

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

MOTIVATION MOTIVATION Dr. M. Thenmozhi Professor Department of Management Studies Indian

Video Analytics Xavier Gir-i-Nieto Motivation 2 Motivation 3 Motivation 4 Outline 1.

MOTIVATION Watch this video on intrinsic versus extrinsic motivation Value x Expectation (of

Learner Motivation Motivational Self-Reflection Self-Reflection Time Travel Think about a time

Motivation What is Motivation? How motivated are you now? What are your thoughts as you enter

RedGate - Enterprise MSE Project - Phase I Integration Server Motivation 2 Motivation 2

Comp/Phys/Mtsc 715 Lecture 2: Motivation and Toolkits 1/13/2011 Motivation and Toolkits

Recent work in Truncated Statistics Andrew Ilyas Motivation: Poincar and the Baker

Comp/Phys/Mtsc 715 Lecture 2: Motivation and Toolkits 1/14/2014 Motivation and Toolkits

Degree Map Empowers Students Set education and career goals Create a clear, personalized

Class 26 4 November 2019 Queues and amortized analysis Rackette operations in detail Abstract

Trademark and Unfair Competition Law Slides 20: Dilution LAWS 7341-001 Prof. Kristelia Garca

Exam Preparation Advice (for ANLP and in general) Shay Cohen (based on slides by Sharon

How Does Questionnaire Design Affect Party ID? Kyley McGeeney Senior Director of Survey Methods

www.evalu ate.org Slides, video, and handout: www.evalu ate.org/webinars/feb 19/ 1

Cognitive Interviewing Debbie Collins What is cognitive interviewing? Cognitive interviewing

Queuing Theory Equations Definition = Arrival Rate = Service Rate = / C = Number

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&M University Motivation

Motivation: Theory & practice 2017-18 I MPORTANCE OF MOTIVATION Employees may lack