CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley - - PowerPoint PPT Presentation

cs 61a cs 98 52
SMART_READER_LITE
LIVE PREVIEW

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley - - PowerPoint PPT Presentation

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 23 Motivation How would you find a substring inside a string? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23


slide-1
SLIDE 1

CS 61A/CS 98-52

Mehrdad Niknami

University of California, Berkeley

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 23

slide-2
SLIDE 2

Motivation

How would you find a substring inside a string?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23

slide-3
SLIDE 3

Motivation

How would you find a substring inside a string? Something like this? (Is this good?)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23

slide-4
SLIDE 4

Motivation

How would you find a substring inside a string? Something like this? (Is this good?) def find(string, pattern): n = len(string) m = len(pattern) for i in range(n - m + 1): is_match = True for j in range(m): if pattern[j] != string[i + j] is_match = False break if is_match: return i

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23

slide-5
SLIDE 5

Motivation

How would you find a substring inside a string? Something like this? (Is this good?) def find(string, pattern): n = len(string) m = len(pattern) for i in range(n - m + 1): is_match = True for j in range(m): if pattern[j] != string[i + j] is_match = False break if is_match: return i What if you were looking for a pattern? Like an email address?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23

slide-6
SLIDE 6

Background

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-7
SLIDE 7

Background

Text processing has been at the heart of computer science since the 1950s

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-8
SLIDE 8

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-9
SLIDE 9

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-10
SLIDE 10

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-11
SLIDE 11

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-12
SLIDE 12

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-13
SLIDE 13

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-14
SLIDE 14

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s Periods & critical factorizations: 1970s (Cesari-Vincent)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-15
SLIDE 15

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s Periods & critical factorizations: 1970s (Cesari-Vincent) [...] Critical factorizations in linear complexity: 2016 (Kosolobov)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-16
SLIDE 16

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s Periods & critical factorizations: 1970s (Cesari-Vincent) [...] Critical factorizations in linear complexity: 2016 (Kosolobov) Research is still ongoing

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-17
SLIDE 17

Background

Text processing has been at the heart of computer science since the 1950s Regular languages: 1950s (Kleene) Context-free languages (CFLs): 1950s (Chomsky) Regular expressions (regexes) & automata: 1960s (Thompson) LR parsing (left-to-right, rightmost-derivation): 1960s (Knuth) Context-free parsers: 1960s (Earley) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s Periods & critical factorizations: 1970s (Cesari-Vincent) [...] Critical factorizations in linear complexity: 2016 (Kosolobov) Research is still ongoing ...apparently more in Europe?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23

slide-18
SLIDE 18

Background

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-19
SLIDE 19

Background

Most of you will probably graduate without learning string processing.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-20
SLIDE 20

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-21
SLIDE 21

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-22
SLIDE 22

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-23
SLIDE 23

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones! O(mn)-time str.find(substring) is bad! You can do much better:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-24
SLIDE 24

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones! O(mn)-time str.find(substring) is bad! You can do much better: Good algorithms finish in O(m + n) time & space (e.g. Z algorithm)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-25
SLIDE 25

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones! O(mn)-time str.find(substring) is bad! You can do much better: Good algorithms finish in O(m + n) time & space (e.g. Z algorithm) The best/coolest finish in O(m + n) time but O(1) space!!!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-26
SLIDE 26

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones! O(mn)-time str.find(substring) is bad! You can do much better: Good algorithms finish in O(m + n) time & space (e.g. Z algorithm) The best/coolest finish in O(m + n) time but O(1) space!!! So, today, I’ll teach a bit about string processing. :)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-27
SLIDE 27

Background

Most of you will probably graduate without learning string processing. Instead, you’ll learn how to process images and Big Data™. Which makes me sad. :( You should know how to solve solved problems! Learn & use 100%-accurate algorithms before 85%-accurate ones! O(mn)-time str.find(substring) is bad! You can do much better: Good algorithms finish in O(m + n) time & space (e.g. Z algorithm) The best/coolest finish in O(m + n) time but O(1) space!!! So, today, I’ll teach a bit about string processing. :) You can learn more in CS 164, CS 176, etc. (Have fun!)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23

slide-28
SLIDE 28

Formal Languages

In formal language theory:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-29
SLIDE 29

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-30
SLIDE 30

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x”

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-31
SLIDE 31

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x” String (or word): finite sequence of letters, e.g. “hi”

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-32
SLIDE 32

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x” String (or word): finite sequence of letters, e.g. “hi” Language: a set of strings, e.g. {“a”, “aa”, “aaa”, . . . } → Often denoted by L

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-33
SLIDE 33

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x” String (or word): finite sequence of letters, e.g. “hi” Language: a set of strings, e.g. {“a”, “aa”, “aaa”, . . . } → Often denoted by L We might omit the quotes/braces, so we’ll use the following denotations:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-34
SLIDE 34

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x” String (or word): finite sequence of letters, e.g. “hi” Language: a set of strings, e.g. {“a”, “aa”, “aaa”, . . . } → Often denoted by L We might omit the quotes/braces, so we’ll use the following denotations: ε: empty string (i.e., “”)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-35
SLIDE 35

Formal Languages

In formal language theory: Alphabet: any set (usually a character set, like English or ASCII) → Often denoted by Σ Letter: an element in the given alphabet, e.g. “x” String (or word): finite sequence of letters, e.g. “hi” Language: a set of strings, e.g. {“a”, “aa”, “aaa”, . . . } → Often denoted by L We might omit the quotes/braces, so we’ll use the following denotations: ε: empty string (i.e., “”) ∅: empty language (i.e., empty set {})

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23

slide-36
SLIDE 36

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-37
SLIDE 37

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-38
SLIDE 38

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages. For example, this grammar describes L = {“”, “hi”, “hihi”, . . .}: S → T T → ε T → T "h" "i"

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-39
SLIDE 39

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages. For example, this grammar describes L = {“”, “hi”, “hihi”, . . .}: S → T T → ε T → T "h" "i" We call S a nonterminal symbol and “h” a terminal symbol (i.e., letter).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-40
SLIDE 40

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages. For example, this grammar describes L = {“”, “hi”, “hihi”, . . .}: S → T T → ε T → T "h" "i" We call S a nonterminal symbol and “h” a terminal symbol (i.e., letter). Each line is a production rule, producing a sentential form on the right.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-41
SLIDE 41

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages. For example, this grammar describes L = {“”, “hi”, “hihi”, . . .}: S → T T → ε T → T "h" "i" We call S a nonterminal symbol and “h” a terminal symbol (i.e., letter). Each line is a production rule, producing a sentential form on the right. To make life easier, we’ll denote these by uppercase and lowercase respectively, omitting quotes and spaces when convenient.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-42
SLIDE 42

Formal Grammars

Languages can be infinite, so we can’t always list all the strings in them. We therefore use grammars to describe languages. For example, this grammar describes L = {“”, “hi”, “hihi”, . . .}: S → T T → ε T → T "h" "i" We call S a nonterminal symbol and “h” a terminal symbol (i.e., letter). Each line is a production rule, producing a sentential form on the right. To make life easier, we’ll denote these by uppercase and lowercase respectively, omitting quotes and spaces when convenient. We then merge and simplify rules via the pipe (OR) symbol: S → S hi | ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23

slide-43
SLIDE 43

Regular Languages

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-44
SLIDE 44

Regular Languages

The following are regular languages over the alphabet Σ:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-45
SLIDE 45

Regular Languages

The following are regular languages over the alphabet Σ: ∅

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-46
SLIDE 46

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε}

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-47
SLIDE 47

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-48
SLIDE 48

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ The union A ∪ B of any regular languages A and B over Σ

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-49
SLIDE 49

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ The union A ∪ B of any regular languages A and B over Σ The concatenation AB of any regular languages A and B over Σ

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-50
SLIDE 50

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ The union A ∪ B of any regular languages A and B over Σ The concatenation AB of any regular languages A and B over Σ The repetition (Kleene star) A∗ of any regular language A over Σ A∗ = {ε} ∪ A ∪ AA ∪ AAA ∪ . . .

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-51
SLIDE 51

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ The union A ∪ B of any regular languages A and B over Σ The concatenation AB of any regular languages A and B over Σ The repetition (Kleene star) A∗ of any regular language A over Σ A∗ = {ε} ∪ A ∪ AA ∪ AAA ∪ . . . Notice that all finite languages are regular, but not all infinite languages.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-52
SLIDE 52

Regular Languages

The following are regular languages over the alphabet Σ: ∅ {ε} {σ} ∀ σ ∈ Σ The union A ∪ B of any regular languages A and B over Σ The concatenation AB of any regular languages A and B over Σ The repetition (Kleene star) A∗ of any regular language A over Σ A∗ = {ε} ∪ A ∪ AA ∪ AAA ∪ . . . Notice that all finite languages are regular, but not all infinite languages. Regular languages do not allow arbitrary “nesting” (e.g. parens).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23

slide-53
SLIDE 53

Regular Grammars

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-54
SLIDE 54

Regular Grammars

A regular grammar is a grammar in which all productions have at most

  • ne nonterminal symbol, all of which appear on either the left or the right.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-55
SLIDE 55

Regular Grammars

A regular grammar is a grammar in which all productions have at most

  • ne nonterminal symbol, all of which appear on either the left or the right.

In other words, this is a regular grammar: S → A b c A → S a | ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-56
SLIDE 56

Regular Grammars

A regular grammar is a grammar in which all productions have at most

  • ne nonterminal symbol, all of which appear on either the left or the right.

In other words, this is a regular grammar: S → A b c A → S a | ε This is not a regular grammar (but it is linear and context-free): S → A b c A → a S | ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-57
SLIDE 57

Regular Grammars

A regular grammar is a grammar in which all productions have at most

  • ne nonterminal symbol, all of which appear on either the left or the right.

In other words, this is a regular grammar: S → A b c A → S a | ε This is not a regular grammar (but it is linear and context-free): S → A b c A → a S | ε and neither is this (it is context-sensitive): S → S s | ε S s → S t

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-58
SLIDE 58

Regular Grammars

A regular grammar is a grammar in which all productions have at most

  • ne nonterminal symbol, all of which appear on either the left or the right.

In other words, this is a regular grammar: S → A b c A → S a | ε This is not a regular grammar (but it is linear and context-free): S → A b c A → a S | ε and neither is this (it is context-sensitive): S → S s | ε S s → S t A language is regular iff it can be described by a regular grammar.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

slide-59
SLIDE 59

Regular Expressions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-60
SLIDE 60

Regular Expressions

A regular expression is an easier way to describe a regular language.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-61
SLIDE 61

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-62
SLIDE 62

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-63
SLIDE 63

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-64
SLIDE 64

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more”

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-65
SLIDE 65

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more” Plus (another quantifier) means “one or more”

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-66
SLIDE 66

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more” Plus (another quantifier) means “one or more” Question mark (another quantifier) means “at most one”

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-67
SLIDE 67

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more” Plus (another quantifier) means “one or more” Question mark (another quantifier) means “at most one” Backslash (“escape”) before a special character means that character

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-68
SLIDE 68

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more” Plus (another quantifier) means “one or more” Question mark (another quantifier) means “at most one” Backslash (“escape”) before a special character means that character Pipe (the OR symbol |) means “either”, and parentheses group

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-69
SLIDE 69

Regular Expressions

A regular expression is an easier way to describe a regular language. It’s essentially a pattern for describing a regular language. For example, in [abcw-z]*(1+2|3)?4\?, we have: [abcw-z] (a character set) means “either a, b, c, w, x, y, or z”. Asterisk (a.k.a. “Kleene star”, a quantifier) means “zero or more” Plus (another quantifier) means “one or more” Question mark (another quantifier) means “at most one” Backslash (“escape”) before a special character means that character Pipe (the OR symbol |) means “either”, and parentheses group So this matches zero or more of a, b, c, w, x, y, z, followed by either nothing or by 3 or by 1’s followed by 2, followed by 4 and a question mark.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23

slide-70
SLIDE 70

Regular Expressions

1If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23

slide-71
SLIDE 71

Regular Expressions

Regular expressions (regexes) are equivalent to regular grammars1, e.g.

Y

  • [abcw-z]*
  • X

(1+ 2|3)?

  • Z

4\?

1If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23

slide-72
SLIDE 72

Regular Expressions

Regular expressions (regexes) are equivalent to regular grammars1, e.g.

Y

  • [abcw-z]*
  • X

(1+ 2|3)?

  • Z

4\? is equivalent to S → Z 4 ? Z → Y 2 | X 3 | ε Y → Y 1 | X 1 X → X a | X b | X c | X w | X x | X y | X z | ε

1If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23

slide-73
SLIDE 73

Regular Expressions

Regular expressions (regexes) are equivalent to regular grammars1, e.g.

Y

  • [abcw-z]*
  • X

(1+ 2|3)?

  • Z

4\? is equivalent to S → Z 4 ? Z → Y 2 | X 3 | ε Y → Y 1 | X 1 X → X a | X b | X c | X w | X x | X y | X z | ε Here, the regex is more compact. Sometimes, the grammar is smaller.

1If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23

slide-74
SLIDE 74

Regular Expressions

Python has a regex engine to find text matching a regex:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-75
SLIDE 75

Regular Expressions

Python has a regex engine to find text matching a regex: >>> import re >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') >>> m <re.Match object; span=(0, 24), match='hello cs61a@berkeley.edu'> >>> m.groups() ('cs61a', 'berkeley.edu')

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-76
SLIDE 76

Regular Expressions

Python has a regex engine to find text matching a regex: >>> import re >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') >>> m <re.Match object; span=(0, 24), match='hello cs61a@berkeley.edu'> >>> m.groups() ('cs61a', 'berkeley.edu') Notice that these could all be handled by re.match:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-77
SLIDE 77

Regular Expressions

Python has a regex engine to find text matching a regex: >>> import re >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') >>> m <re.Match object; span=(0, 24), match='hello cs61a@berkeley.edu'> >>> m.groups() ('cs61a', 'berkeley.edu') Notice that these could all be handled by re.match: Substring search (str.find)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-78
SLIDE 78

Regular Expressions

Python has a regex engine to find text matching a regex: >>> import re >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') >>> m <re.Match object; span=(0, 24), match='hello cs61a@berkeley.edu'> >>> m.groups() ('cs61a', 'berkeley.edu') Notice that these could all be handled by re.match: Substring search (str.find) Subsequence search (re.match(".*b.*b", "abbc"))

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-79
SLIDE 79

Regular Expressions

Python has a regex engine to find text matching a regex: >>> import re >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') >>> m <re.Match object; span=(0, 24), match='hello cs61a@berkeley.edu'> >>> m.groups() ('cs61a', 'berkeley.edu') Notice that these could all be handled by re.match: Substring search (str.find) Subsequence search (re.match(".*b.*b", "abbc")) The grep tool (from ed’s g/re/p = global/regex/print) does this for files.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23

slide-80
SLIDE 80

Regular Expressions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-81
SLIDE 81

Regular Expressions

Million-dollar question:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-82
SLIDE 82

Regular Expressions

Million-dollar question: How do you find text matching a regex?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-83
SLIDE 83

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-84
SLIDE 84

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-85
SLIDE 85

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure 2 Use the regex to parse the actual text (corpus) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-86
SLIDE 86

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure 2 Use the regex to parse the actual text (corpus)

It turns out that:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-87
SLIDE 87

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure 2 Use the regex to parse the actual text (corpus)

It turns out that:

1 Step 1 is theoretically harder, but practically easier.

(This can be done similarly to how you parsed Scheme.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-88
SLIDE 88

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure 2 Use the regex to parse the actual text (corpus)

It turns out that:

1 Step 1 is theoretically harder, but practically easier.

(This can be done similarly to how you parsed Scheme.)

2 Step 2 is theoretically easier, but practically harder. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-89
SLIDE 89

Regular Expressions

Million-dollar question: How do you find text matching a regex? Two steps:

1 Parse the regex (pattern) to “understand” its structure 2 Use the regex to parse the actual text (corpus)

It turns out that:

1 Step 1 is theoretically harder, but practically easier.

(This can be done similarly to how you parsed Scheme.)

2 Step 2 is theoretically easier, but practically harder.

This is because we need parsing the corpus to be fast.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23

slide-90
SLIDE 90

Regular Expressions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-91
SLIDE 91

Regular Expressions

How do you solve each step?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-92
SLIDE 92

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-93
SLIDE 93

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-94
SLIDE 94

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else. Problem: Recursive-descent can take exponential time!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-95
SLIDE 95

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else. Problem: Recursive-descent can take exponential time! Example (where “a{3}” is shorthand for “aaa”):

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-96
SLIDE 96

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else. Problem: Recursive-descent can take exponential time! Example (where “a{3}” is shorthand for “aaa”): >>> re.match("(a?){25}a{25}", "a" * 25)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-97
SLIDE 97

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else. Problem: Recursive-descent can take exponential time! Example (where “a{3}” is shorthand for “aaa”): >>> re.match("(a?){25}a{25}", "a" * 25) Can we hope to parse corpora in time linear to their lengths?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-98
SLIDE 98

Regular Expressions

How do you solve each step? Both steps are often done using “recursive-descent”—similarly to how your Scheme parser parsed its input. Basically: try every possibility recursively. “Backtrack” on failure to try something else. Problem: Recursive-descent can take exponential time! Example (where “a{3}” is shorthand for “aaa”): >>> re.match("(a?){25}a{25}", "a" * 25) Can we hope to parse corpora in time linear to their lengths? Yes, using finite automata.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23

slide-99
SLIDE 99

Finite Automata

A finite automaton (FA) consists of the following (example below)2:

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-100
SLIDE 100

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here)

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-101
SLIDE 101

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here) A finite set of states S ({s0, s1, s2} here)

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-102
SLIDE 102

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here) A finite set of states S ({s0, s1, s2} here) An initial state s0 ∈ S (s0 here)

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-103
SLIDE 103

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here) A finite set of states S ({s0, s1, s2} here) An initial state s0 ∈ S (s0 here) A set of accepting (or final) states F ⊂ S ({s2} here)

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-104
SLIDE 104

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here) A finite set of states S ({s0, s1, s2} here) An initial state s0 ∈ S (s0 here) A set of accepting (or final) states F ⊂ S ({s2} here) A transition function δ : S × Σ → 2S (the arrows here)

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-105
SLIDE 105

Finite Automata

A finite automaton (FA) consists of the following (example below)2: An input alphabet Σ ({0, 1} here) A finite set of states S ({s0, s1, s2} here) An initial state s0 ∈ S (s0 here) A set of accepting (or final) states F ⊂ S ({s2} here) A transition function δ : S × Σ → 2S (the arrows here) s0 s1 s2 1 1 1

2Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23

slide-106
SLIDE 106

Finite Automata

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23

slide-107
SLIDE 107

Finite Automata

Notice the transition function δ outputs a subset of states.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23

slide-108
SLIDE 108

Finite Automata

Notice the transition function δ outputs a subset of states. In a deterministic finite automaton (DFA), the transition function always

  • utputs a set with exactly one state (a singleton).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23

slide-109
SLIDE 109

Finite Automata

Notice the transition function δ outputs a subset of states. In a deterministic finite automaton (DFA), the transition function always

  • utputs a set with exactly one state (a singleton).

i.e., in a DFA, the next state is determined by the input & current state. (i.e., every state has exactly 1 arrow leaving it for each possible input.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23

slide-110
SLIDE 110

Finite Automata

Notice the transition function δ outputs a subset of states. In a deterministic finite automaton (DFA), the transition function always

  • utputs a set with exactly one state (a singleton).

i.e., in a DFA, the next state is determined by the input & current state. (i.e., every state has exactly 1 arrow leaving it for each possible input.) In a nondeterministic finite automaton (NFA), the above is not true.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23

slide-111
SLIDE 111

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3

3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-112
SLIDE 112

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else!

3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-113
SLIDE 113

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-114
SLIDE 114

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-115
SLIDE 115

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 2 Feed corpus to FA 3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-116
SLIDE 116

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 2 Feed corpus to FA in linear time! 3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-117
SLIDE 117

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 2 Feed corpus to FA in linear time! 3 ... 3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-118
SLIDE 118

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 2 Feed corpus to FA in linear time! 3 ... 4 Profit! 3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-119
SLIDE 119

Finite Automata

Finite automata are language recognizers: you feed a string as an input, and if it accepts the input string, the string is in its language.3 In particular: = ⇒ Finite automata recognize regular languages, and nothing else! Therefore, we can:

1 Convert regex pattern to FA 2 Feed corpus to FA in linear time! 3 ... 4 Profit!

But how can we do this?

3Pumping lemma: A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

slide-120
SLIDE 120

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-121
SLIDE 121

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-122
SLIDE 122

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-123
SLIDE 123

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-124
SLIDE 124

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-125
SLIDE 125

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-126
SLIDE 126

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-127
SLIDE 127

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-128
SLIDE 128

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)•

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-129
SLIDE 129

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)• s2 = (a|b)*(1+2|3•)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-130
SLIDE 130

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)• s2 = (a|b)*(1+2|3•) = (a|b)*(1+2|3•)•

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-131
SLIDE 131

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)• s2 = (a|b)*(1+2|3•) = (a|b)*(1+2|3•)• s2 = (a|b)*(1+2•|3•)••

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-132
SLIDE 132

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)• s2 = (a|b)*(1+2|3•) = (a|b)*(1+2|3•)• s2 = (a|b)*(1+2•|3•)•• s0 s1 s2 a b 1 1 2 3

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-133
SLIDE 133

Finite Automata from Regular Expressions

Consider: (a|b)*(1+2|3). Ask: Where in the pattern can we be?

1 s0 = •(a|b)*(1+2|3)

= •(•a|•b)*(1+2|3) = •(•a|•b)*•(1+2|3) = •(•a|•b)*•(•1+2|•3)

2 s1 = (a|b)*(•1+•2|3) 3 s2 = (a|b)*(1+2•|3)

= (a|b)*(1+2•|3)• s2 = (a|b)*(1+2|3•) = (a|b)*(1+2|3•)• s2 = (a|b)*(1+2•|3•)•• s0 s1 s2 a b 1 1 2 3 (Expanding a state to its equivalents is a mathematical closure operation.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 17 / 23

slide-134
SLIDE 134

Finite Automata from Regular Expressions

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 23

slide-135
SLIDE 135

Finite Automata from Regular Expressions

We created a deterministic finite automaton (DFA) from a regex!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 23

slide-136
SLIDE 136

Finite Automata from Regular Expressions

We created a deterministic finite automaton (DFA) from a regex! It can find regular patterns (substrings, subsequences, etc.) in linear time.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 23

slide-137
SLIDE 137

Finite Automata from Regular Expressions

We created a deterministic finite automaton (DFA) from a regex! It can find regular patterns (substrings, subsequences, etc.) in linear time. However: there is no such thing as a free lunch.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 23

slide-138
SLIDE 138

Finite Automata from Regular Expressions

We created a deterministic finite automaton (DFA) from a regex! It can find regular patterns (substrings, subsequences, etc.) in linear time. However: there is no such thing as a free lunch. What is the caveat?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 18 / 23

slide-139
SLIDE 139

Finite Automata from Regular Expressions

Caveat:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-140
SLIDE 140

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-141
SLIDE 141

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-142
SLIDE 142

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-143
SLIDE 143

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-144
SLIDE 144

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-145
SLIDE 145

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-146
SLIDE 146

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically. → We already did this. It often works well.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-147
SLIDE 147

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically. → We already did this. It often works well. Boring: Use an NFA instead of a DFA. (Track state items separately.)

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-148
SLIDE 148

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically. → We already did this. It often works well. Boring: Use an NFA instead of a DFA. (Track state items separately.) → Guarantees O(m) memory usage, but running time is O(mn).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-149
SLIDE 149

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically. → We already did this. It often works well. Boring: Use an NFA instead of a DFA. (Track state items separately.) → Guarantees O(m) memory usage, but running time is O(mn). Clever: Build the DFA lazily as needed by input.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-150
SLIDE 150

Finite Automata from Regular Expressions

Caveat: The number of states |S| can be exponential in the size of the pattern m. (This is sometimes referred to as the size of the state space.) Why? Because we compute subsets of locations in the pattern, and we could encounter around 2m subsets for a pattern of length m. Solution? DFA minimization: Always merge states that behave identically. → We already did this. It often works well. Boring: Use an NFA instead of a DFA. (Track state items separately.) → Guarantees O(m) memory usage, but running time is O(mn). Clever: Build the DFA lazily as needed by input. → Memory usage becomes O(m + n), but running time approaches O(n).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 19 / 23

slide-151
SLIDE 151

Conclusion

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-152
SLIDE 152

Conclusion

Automata are extremely powerful!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-153
SLIDE 153

Conclusion

Automata are extremely powerful! They can do many other cool things:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-154
SLIDE 154

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-155
SLIDE 155

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern. When given a stack, LR automata can parse context-free languages (like many programming languages) in linear time (CS 164).

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-156
SLIDE 156

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern. When given a stack, LR automata can parse context-free languages (like many programming languages) in linear time (CS 164). B¨ uchi automata, which allow infinite-length input strings, are used for formal verification of computer programs.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-157
SLIDE 157

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern. When given a stack, LR automata can parse context-free languages (like many programming languages) in linear time (CS 164). B¨ uchi automata, which allow infinite-length input strings, are used for formal verification of computer programs. Finite-state machines (very similar) are widely used in digital design:

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-158
SLIDE 158

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern. When given a stack, LR automata can parse context-free languages (like many programming languages) in linear time (CS 164). B¨ uchi automata, which allow infinite-length input strings, are used for formal verification of computer programs. Finite-state machines (very similar) are widely used in digital design: Used in engineering to prove digital systems work as intended

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-159
SLIDE 159

Conclusion

Automata are extremely powerful! They can do many other cool things: Levenshtein automata can recognize corpora that are k “edit distances” (insertions, deletions, or mutations) away from a pattern. When given a stack, LR automata can parse context-free languages (like many programming languages) in linear time (CS 164). B¨ uchi automata, which allow infinite-length input strings, are used for formal verification of computer programs. Finite-state machines (very similar) are widely used in digital design: Used in engineering to prove digital systems work as intended Used to optimize power consumption, logic circuitry, etc.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 20 / 23

slide-160
SLIDE 160

Conclusion

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-161
SLIDE 161

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-162
SLIDE 162

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)... It is extremely easy to graduate and avoid languages & automata. But they provide the keys for solving many otherwise difficult problems. You can see more in EE/CS 144, 149, 151, 164, 172, 176, 219C, 291E...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-163
SLIDE 163

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)... It is extremely easy to graduate and avoid languages & automata. But they provide the keys for solving many otherwise difficult problems. You can see more in EE/CS 144, 149, 151, 164, 172, 176, 219C, 291E... ∼ Related Words of Wisdom ∼

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-164
SLIDE 164

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)... It is extremely easy to graduate and avoid languages & automata. But they provide the keys for solving many otherwise difficult problems. You can see more in EE/CS 144, 149, 151, 164, 172, 176, 219C, 291E... ∼ Related Words of Wisdom ∼ Minimizing the number of states in your design (e.g. factoring out duplicate data) helps keep designs clean & bug-free.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-165
SLIDE 165

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)... It is extremely easy to graduate and avoid languages & automata. But they provide the keys for solving many otherwise difficult problems. You can see more in EE/CS 144, 149, 151, 164, 172, 176, 219C, 291E... ∼ Related Words of Wisdom ∼ Minimizing the number of states in your design (e.g. factoring out duplicate data) helps keep designs clean & bug-free. One reason: single source of truth. If it can’t be wrong, it won’t be.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-166
SLIDE 166

Conclusion

This is just the tip of the iceberg for string algorithms (and automata). Languages, grammars, and automata are also used in computational linguistics, computational biology/genomics (DNA alignment/matching)... It is extremely easy to graduate and avoid languages & automata. But they provide the keys for solving many otherwise difficult problems. You can see more in EE/CS 144, 149, 151, 164, 172, 176, 219C, 291E... ∼ Related Words of Wisdom ∼ Minimizing the number of states in your design (e.g. factoring out duplicate data) helps keep designs clean & bug-free. One reason: single source of truth. If it can’t be wrong, it won’t be. Kleeneliness is next to G¨

  • deliness.

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 21 / 23

slide-167
SLIDE 167

...Calculus?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-168
SLIDE 168

...Calculus?

Bonus: What language does S describe? S → S a | ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-169
SLIDE 169

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-170
SLIDE 170

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-171
SLIDE 171

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-172
SLIDE 172

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-173
SLIDE 173

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε S = ε ε − a

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-174
SLIDE 174

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε S = ε ε − a ...wait, what?

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-175
SLIDE 175

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε S = ε ε − a ...wait, what? Oh, right—Taylor series...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-176
SLIDE 176

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε S = ε ε − a ...wait, what? Oh, right—Taylor series... S = ε + a + a2 + . . .

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-177
SLIDE 177

...Calculus?

Bonus: What language does S describe? S → S a | ε Hmm, union and concatenation sure look like addition & multiplication... S = Sa + ε Sε = Sa + ε S(ε − a) = ε S = ε ε − a ...wait, what? Oh, right—Taylor series... S = ε + a + a2 + . . . :-) Please don’t ask...

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 22 / 23

slide-178
SLIDE 178

Conclusion

Thank you!

Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 23 / 23