Speech and Language Processing Lecture 2 Chapter 2 of SLP Today - - PowerPoint PPT Presentation

speech and language processing
SMART_READER_LITE
LIVE PREVIEW

Speech and Language Processing Lecture 2 Chapter 2 of SLP Today - - PowerPoint PPT Presentation

Speech and Language Processing Lecture 2 Chapter 2 of SLP Today Finite-state methods Speech and Language Processing - Jurafsky and Martin 7/29/08 2 Regular Expressions and Text Searching Everybody does it Emacs, vi, perl, grep,


slide-1
SLIDE 1

Speech and Language Processing

Lecture 2 Chapter 2 of SLP

slide-2
SLIDE 2

7/29/08

Speech and Language Processing - Jurafsky and Martin

2

Today

  • Finite-state methods
slide-3
SLIDE 3

7/29/08

Speech and Language Processing - Jurafsky and Martin

3

Regular Expressions and Text Searching

  • Everybody does it

 Emacs, vi, perl, grep, etc..

  • Regular expressions are a compact textual

representation of a set of strings representing a language.

slide-4
SLIDE 4

7/29/08

Speech and Language Processing - Jurafsky and Martin

4

Example

  • Find all the instances of the word “the” in

a text.

 /the/  /[tT]he/  /\b[tT]he\b/

slide-5
SLIDE 5

7/29/08

Speech and Language Processing - Jurafsky and Martin

5

Errors

  • The process we just went through was

based on two fixing kinds of errors

 Matching strings that we should not have matched (there, then, other)

  • False positives (Type I)

 Not matching things that we should have matched (The)

  • False negatives (Type II)
slide-6
SLIDE 6

7/29/08

Speech and Language Processing - Jurafsky and Martin

6

Errors

  • We’ll be telling the same story for many

tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts:

 Increasing accuracy, or precision, (minimizing false positives)  Increasing coverage, or recall, (minimizing false negatives).

slide-7
SLIDE 7

7/29/08

Speech and Language Processing - Jurafsky and Martin

7

Finite State Automata

  • Regular expressions can be viewed as a

textual way of specifying the structure of finite-state automata.

  • FSAs and their probabilistic relatives are at

the core of much of what we’ll be doing all semester.

  • They also capture significant aspects of

what linguists say we need for morphology and parts of syntax.

slide-8
SLIDE 8

7/29/08

Speech and Language Processing - Jurafsky and Martin

8

FSAs as Graphs

  • Let’s start with the sheep language from

Chapter 2

 /baa+!/

slide-9
SLIDE 9

7/29/08

Speech and Language Processing - Jurafsky and Martin

9

Sheep FSA

  • We can say the following things about this

machine

 It has 5 states  b, a, and ! are in its alphabet  q0 is the start state  q4 is an accept state  It has 5 transitions

slide-10
SLIDE 10

7/29/08

Speech and Language Processing - Jurafsky and Martin

10

But Note

  • There are other machines that

correspond to this same language

  • More on this one later
slide-11
SLIDE 11

7/29/08

Speech and Language Processing - Jurafsky and Martin

11

More Formally

  • You can specify an FSA by enumerating

the following things.

 The set of states: Q  A finite alphabet: Σ  A start state  A set of accept/final states  A transition function that maps QxΣ to Q

slide-12
SLIDE 12

7/29/08

Speech and Language Processing - Jurafsky and Martin

12

About Alphabets

  • Don’t take term alphabet word too

narrowly; it just means we need a finite set of symbols in the input.

  • These symbols can and will stand for

bigger objects that can have internal structure.

slide-13
SLIDE 13

7/29/08

Speech and Language Processing - Jurafsky and Martin

13

Dollars and Cents

slide-14
SLIDE 14

7/29/08

Speech and Language Processing - Jurafsky and Martin

14

Yet Another View

  • The guts of FSAs

can ultimately be represented as tables 4 4 3 2,3 2 2 1 1 e ! a b

If you’re in state 1 and you’re looking at an a, go to state 2

slide-15
SLIDE 15

7/29/08

Speech and Language Processing - Jurafsky and Martin

15

Recognition

  • Recognition is the process of determining if

a string should be accepted by a machine

  • Or… it’s the process of determining if a

string is in the language we’re defining with the machine

  • Or… it’s the process of determining if a

regular expression matches a string

  • Those all amount the same thing in the end
slide-16
SLIDE 16

7/29/08

Speech and Language Processing - Jurafsky and Martin

16

Recognition

  • Traditionally, (Turing’s notion) this process is

depicted with a tape.

slide-17
SLIDE 17

7/29/08

Speech and Language Processing - Jurafsky and Martin

17

Recognition

  • Simply a process of starting in the start

state

  • Examining the current input
  • Consulting the table
  • Going to a new state and updating the

tape pointer.

  • Until you run out of tape.
slide-18
SLIDE 18

7/29/08

Speech and Language Processing - Jurafsky and Martin

18

D-Recognize

slide-19
SLIDE 19

7/29/08

Speech and Language Processing - Jurafsky and Martin

19

Key Points

  • Deterministic means that at each point in

processing there is always one unique thing to do (no choices).

  • D-recognize is a simple table-driven

interpreter

  • The algorithm is universal for all

unambiguous regular languages.

 To change the machine, you simply change the table.

slide-20
SLIDE 20

7/29/08

Speech and Language Processing - Jurafsky and Martin

20

Key Points

  • Crudely therefore… matching strings with

regular expressions (ala Perl, grep, etc.) is a matter of

 translating the regular expression into a machine (a table) and  passing the table and the string to an interpreter

slide-21
SLIDE 21

7/29/08

Speech and Language Processing - Jurafsky and Martin

21

Recognition as Search

  • You can view this algorithm as a trivial kind
  • f state-space search.
  • States are pairings of tape positions and

state numbers.

  • Operators are compiled into the table
  • Goal state is a pairing with the end of tape

position and a final accept state

  • It is trivial because?
slide-22
SLIDE 22

7/29/08

Speech and Language Processing - Jurafsky and Martin

22

Generative Formalisms

  • Formal Languages are sets of strings

composed of symbols from a finite set of symbols.

  • Finite-state automata define formal

languages (without having to enumerate all the strings in the language)

  • The term Generative is based on the view

that you can run the machine as a generator to get strings from the language.

slide-23
SLIDE 23

7/29/08

Speech and Language Processing - Jurafsky and Martin

23

Generative Formalisms

  • FSAs can be viewed from two

perspectives:

 Acceptors that can tell you if a string is in the language  Generators to produce all and only the strings in the language

slide-24
SLIDE 24

7/29/08

Speech and Language Processing - Jurafsky and Martin

24

Non-Determinism

slide-25
SLIDE 25

7/29/08

Speech and Language Processing - Jurafsky and Martin

25

Non-Determinism cont.

  • Yet another technique

 Epsilon transitions  Key point: these transitions do not examine or advance the tape during recognition

slide-26
SLIDE 26

7/29/08

Speech and Language Processing - Jurafsky and Martin

26

Equivalence

  • Non-deterministic machines can be

converted to deterministic ones with a fairly simple construction

  • That means that they have the same

power; non-deterministic machines are not more powerful than deterministic

  • nes in terms of the languages they can

accept

slide-27
SLIDE 27

7/29/08

Speech and Language Processing - Jurafsky and Martin

27

ND Recognition

  • Two basic approaches (used in all major

implementations of regular expressions, see Friedl 2006)

  • 1. Either take a ND machine and convert it to a

D machine and then do recognition with that.

  • 2. Or explicitly manage the process of

recognition as a state-space search (leaving the machine as is).

slide-28
SLIDE 28

7/29/08

Speech and Language Processing - Jurafsky and Martin

28

Non-Deterministic Recognition: Search

  • In a ND FSA there exists at least one path

through the machine for a string that is in the language defined by the machine.

  • But not all paths directed through the machine

for an accept string lead to an accept state.

  • No paths through the machine lead to an accept

state for a string not in the language.

slide-29
SLIDE 29

7/29/08

Speech and Language Processing - Jurafsky and Martin

29

Non-Deterministic Recognition

  • So success in non-deterministic

recognition occurs when a path is found through the machine that ends in an accept.

  • Failure occurs when all of the possible

paths for a given string lead to failure.

slide-30
SLIDE 30

7/29/08

Speech and Language Processing - Jurafsky and Martin

30

Example

b a a a ! \

q0 q1 q2 q2 q3 q4

slide-31
SLIDE 31

7/29/08

Speech and Language Processing - Jurafsky and Martin

31

Example

slide-32
SLIDE 32

7/29/08

Speech and Language Processing - Jurafsky and Martin

32

Example

slide-33
SLIDE 33

7/29/08

Speech and Language Processing - Jurafsky and Martin

33

Example

slide-34
SLIDE 34

7/29/08

Speech and Language Processing - Jurafsky and Martin

34

Example

slide-35
SLIDE 35

7/29/08

Speech and Language Processing - Jurafsky and Martin

35

Example

slide-36
SLIDE 36

7/29/08

Speech and Language Processing - Jurafsky and Martin

36

Example

slide-37
SLIDE 37

7/29/08

Speech and Language Processing - Jurafsky and Martin

37

Example

slide-38
SLIDE 38

7/29/08

Speech and Language Processing - Jurafsky and Martin

38

Example

slide-39
SLIDE 39

7/29/08

Speech and Language Processing - Jurafsky and Martin

39

Key Points

  • States in the search space are pairings of

tape positions and states in the machine.

  • By keeping track of as yet unexplored

states, a recognizer can systematically explore all the paths through the machine given an input.

slide-40
SLIDE 40

7/29/08

Speech and Language Processing - Jurafsky and Martin

40

Why Bother?

  • Non-determinism doesn’t get us more

formal power and it causes headaches so why bother?

 More natural (understandable) solutions

slide-41
SLIDE 41

7/29/08

Speech and Language Processing - Jurafsky and Martin

41

Compositional Machines

  • Formal languages are just sets of strings
  • Therefore, we can talk about various set
  • perations (intersection, union,

concatenation)

  • This turns out to be a useful exercise
slide-42
SLIDE 42

7/29/08

Speech and Language Processing - Jurafsky and Martin

42

Union

slide-43
SLIDE 43

7/29/08

Speech and Language Processing - Jurafsky and Martin

43

Concatenation

slide-44
SLIDE 44

7/29/08

Speech and Language Processing - Jurafsky and Martin

44

Negation

  • Construct a machine M2 to accept all

strings not accepted by machine M1 and reject all the strings accepted by M1

 Invert all the accept and not accept states in M1

  • Does that work for non-deterministic

machines?

slide-45
SLIDE 45

7/29/08

Speech and Language Processing - Jurafsky and Martin

45

Intersection

  • Accept a string that is in both of two

specified languages

  • An indirect construction…

 A^B = ~(~A or ~B)