Hidden Markov Models COSI 114 Computational Linguistics James - - PowerPoint PPT Presentation

hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models COSI 114 Computational Linguistics James - - PowerPoint PPT Presentation

Hidden Markov Models COSI 114 Computational Linguistics James Pustejovsky March 7, 2017 Brandeis University Slides thanks to David Blei Markov Models { s , s , , s } Set of states: 1 2 N Process moves from one state


slide-1
SLIDE 1

Hidden Markov Models

COSI 114 – Computational Linguistics James Pustejovsky March 7, 2017 Brandeis University

Slides thanks to David Blei

slide-2
SLIDE 2
  • Set of states:
  • Process moves from one state to another generating a

sequence of states :

  • Markov chain property: probability of each subsequent state

depends only on what was the previous state:

  • To define Markov model, the following probabilities have to be

specified: transition probabilities and initial probabilities

Markov Models

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

) | (

j i ij

s s P a =

) ( i

i

s P = π

slide-3
SLIDE 3

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-4
SLIDE 4
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-5
SLIDE 5

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-6
SLIDE 6

Low High 0.7 0.3 0.2 0.8 Dry Rain

0.6 0.6 0.4 0.4

Example of Hidden Markov Model

slide-7
SLIDE 7

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-8
SLIDE 8
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-9
SLIDE 9

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-10
SLIDE 10
  • Two states : ‘Low’ and ‘High’ atmospheric pressure.
  • Two observations : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

  • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .

  • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

slide-11
SLIDE 11

What is an HMM?

— Graphical Model — Circles indicate states — Arrows indicate probabilistic dependencies

between states

slide-12
SLIDE 12

What is an HMM?

— Green circles are hidden states — Dependent only on the previous state — “The past is independent of the future given the

present.”

slide-13
SLIDE 13

What is an HMM?

— Purple nodes are observed states — Dependent only on their corresponding hidden

state

slide-14
SLIDE 14

HMM Formalism

— {S, K, Π, Α, Β} — S : {s1…sN } are the values for the hidden states — K : {k1…kM } are the values for the observations

S S S K K K S K S K

slide-15
SLIDE 15

HMM Formalism

— {S, K, Π, Α, Β} — Π = {πι} are the initial state probabilities — A = {aij} are the state transition probabilities — B = {bik} are the observation state probabilities

A B A A A B B S S S K K K S K S K

slide-16
SLIDE 16

Inference in an HMM

— Compute the probability of a given observation

sequence

— Given an observation sequence, compute the

most likely hidden state sequence

— Given an observation sequence and set of

possible models, which model most closely fits the data?

slide-17
SLIDE 17

) | ( Compute ) , , ( , ) ... ( 1 µ µ O P B A

  • O

T

Π = =

  • T
  • 1
  • t
  • t-1
  • t+1

Given an observation sequence and a model, compute the probability of the observation sequence

Decoding

slide-18
SLIDE 18

Decoding

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-19
SLIDE 19

Decoding

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-20
SLIDE 20

Decoding

) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-21
SLIDE 21

Decoding

) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

=

X

X P X O P O P ) | ( ) , | ( ) | ( µ µ µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-22
SLIDE 22

1 1 1 1 1 1 1

1 1 } ... {

) | (

+ + +

Π ∑

− =

=

t t t t T

  • x

x x T t x x

  • x

x

b a b O P π µ

Decoding

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-23
SLIDE 23

) | , ... ( ) (

1

µ α i x

  • P

t

t t i

= =

Forward Procedure

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

  • Special structure gives us an efficient solution

using dynamic programming.

  • Intuition: Probability of the first t observations is

the same for all possible t+1 length state sequences.

  • Define:
slide-24
SLIDE 24

) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α

slide-25
SLIDE 25
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-26
SLIDE 26
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-27
SLIDE 27
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-28
SLIDE 28

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-29
SLIDE 29

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-30
SLIDE 30

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-31
SLIDE 31

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-32
SLIDE 32

) | ... ( ) ( i x

  • P

t

t T t i

= = β

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Backward Procedure

1 ) 1 ( = + T

i

β

=

+ =

N j j io ij i

t b a t

t

... 1

) 1 ( ) ( β β

Probability of the rest

  • f the states given the

first state

slide-33
SLIDE 33
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Decoding Solution

=

=

N i i T

O P

1

) ( ) | ( α µ

=

=

N i i i

O P

1

) 1 ( ) | ( β π µ ) ( ) ( ) | (

1

t t O P

i N i i

β α µ

=

=

Forward Procedure Backward Procedure Combination

slide-34
SLIDE 34
  • T
  • 1
  • t
  • t-1
  • t+1

Best State Sequence

— Find the state sequence that best explains the observations — Viterbi algorithm —

) | ( max arg O X P

X

slide-35
SLIDE 35
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) , , ... , ... ( max ) (

1 1 1 1 ...

1 1

t t t t x x j

  • j

x

  • x

x P t

t

= =

− −

δ

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the

  • bservation at time t

x1 xt-1 j

slide-36
SLIDE 36
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) , , ... , ... ( max ) (

1 1 1 1 ...

1 1

t t t t x x j

  • j

x

  • x

x P t

t

= =

− −

δ

1

) ( max ) 1 (

+

= +

t

jo ij i i j

b a t t δ δ

1

) ( max arg ) 1 (

+

= +

t

jo ij i i j

b a t t δ ψ

Recursive Computation

x1 xt-1 xt xt+1

slide-37
SLIDE 37
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) ( max arg ˆ T X

i i T

δ =

) 1 ( ˆ

1 ^

+ =

+ t

X

t

X t

ψ

) ( max arg ) ˆ ( T X P

i i

δ =

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

slide-38
SLIDE 38
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

  • Given an observation sequence, find the model

that is most likely to produce that sequence.

  • No analytic method
  • Given a model and observation sequence, update

the model parameters to better fit the

  • bservations.

A B A A A B B B B

slide-39
SLIDE 39
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

A B A A A B B B B

=

+ =

+

N m m m j jo ij i t

t t t b a t j i p

t

... 1

) ( ) ( ) 1 ( ) ( ) , (

1

β α β α

Probability of traversing an arc

=

=

N j t i

j i p t

... 1

) , ( ) ( γ

Probability of being in state i

slide-40
SLIDE 40
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

A B A A A B B B B

) 1 ( ˆ

i

γ π =

i

Now we can compute the new estimates of the model parameters.

∑ ∑

= =

=

T t i T t t ij

t j i p a

1 1

) ( ) , ( ˆ γ

∑ ∑

= =

=

T t i k

  • t

t ik

t i b

t

1 } : {

) ( ) ( ˆ γ γ

slide-41
SLIDE 41

HMM Applications

— Generating parameters for n-gram models — Tagging speech — Speech recognition

slide-42
SLIDE 42
  • T
  • 1
  • t
  • t-1
  • t+1

The Most Important Thing

A B A A A B B B B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

slide-43
SLIDE 43

Low High 0.7 0.3 0.2 0.8 Dry Rain

0.6 0.6 0.4 0.4

Example of Hidden Markov Model

slide-44
SLIDE 44

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-45
SLIDE 45
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-46
SLIDE 46

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-47
SLIDE 47
  • Two states : ‘Low’ and ‘High’ atmospheric pressure.
  • Two observations : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

  • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .

  • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

slide-48
SLIDE 48
  • Suppose we want to calculate a probability of a sequence of observations in our

example, {‘Dry’,’Rain’}.

  • Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} ,

{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.4*0.4*0.6*0.4*0.3 Calculation of observation sequence probability

slide-49
SLIDE 49

Evaluation problem. Given the HMM M=(A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .

  • Decoding problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence O.

  • Learning problem. Given some training observation sequences O=o1 o2 ... oK

and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, π) that best fit training data.

O=o1...oK denotes a sequence of observations ok∈{v1,…,vM}.

Main issues using HMMs :

slide-50
SLIDE 50
  • Typed word recognition, assume all characters are separated.
  • Character recognizer outputs probability of the image being particular character,

P(image|character). 0.5 0.03 0.005 0.31 z c b a

Word recognition example(1).

Hidden state Observation

slide-51
SLIDE 51
  • Hidden states of HMM = characters.
  • Observations = typed images of characters segmented from the

image . Note that there is an infinite number of observations

  • Observation probabilities = character recognizer scores.
  • Transition probabilities will be defined differently in two subsequent models.

Word recognition example(2).

( ) ( )

) | ( ) (

i i

s v P v b B

α α

= =

α

v

slide-52
SLIDE 52
  • If lexicon is given, we can construct separate HMM models for each lexicon word.

Amherst a m h e r s t Buffalo b u f f a l

  • 0.5

0.03

  • Here recognition of word image is equivalent to the problem of evaluating few

HMM models.

  • This is an application of Evaluation problem.

Word recognition example(3).

0.4 0.6

slide-53
SLIDE 53
  • We can construct a single HMM for all words.
  • Hidden states = all characters in the alphabet.
  • Transition probabilities and initial probabilities are calculated from language model.
  • Observations and observation probabilities are as before.

a m h e r s t b v f

  • Here we have to determine the best sequence of hidden states, the one that most

likely produced word image.

  • This is an application of Decoding problem.

Word recognition example(4).

slide-54
SLIDE 54
  • The structure of hidden states is chosen.
  • Observations are feature vectors extracted from vertical slices.
  • Probabilistic mapping from hidden state to feature vectors:
  • 1. use mixture of

Gaussian models

  • 2. Quantize feature vector space.

Character recognition with HMM example.

slide-55
SLIDE 55
  • The structure of hidden states:
  • Observation = number of islands in the vertical slice.

s1 s2 s3

  • HMM for character ‘A’ :

Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ .1 .8 .1 ⏐ ⎝ .9 .1 0 ⎠

  • HMM for character ‘B’ :

Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ 0 .2 .8 ⏐ ⎝ .6 .4 0 ⎠

Exercise: character recognition with HMM(1)

slide-56
SLIDE 56
  • Suppose that after character image segmentation the following sequence of island

numbers in 4 slices was observed: { 1, 3, 2, 1}

  • What HMM is more likely to generate this observation sequence , HMM for

‘A’ or HMM for ‘B’ ?

Exercise: character recognition with HMM(2)

slide-57
SLIDE 57

Consider likelihood of generating given observation for each possible sequence of hidden states:

  • HMM for character ‘A’:

Hidden state sequence Transition probabilities Observation probabilities

s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .8 * .9 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .1 * .8 * .9 = 0.0020736 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .1 * .1 * .9 = 0.000324 Total = 0.0023976

  • HMM for character ‘B’:

Hidden state sequence Transition probabilities Observation probabilities

s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .2 * .6 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .8 * .2 * .6 = 0.0027648 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .8 * .4 * .6 = 0.006912 Total = 0.0096768

Exercise: character recognition with HMM(3)

slide-58
SLIDE 58
  • Evaluation problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .

  • Trying to find probability of observations O=o1 o2 ... oK by means of considering all

hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity.

  • Use Forward-Backward HMM algorithms for efficient calculations.
  • Define the forward variable αk(i) as the joint probability of the partial observation

sequence o1 o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1 o2 ...

  • k , qk= si )

Evaluation Problem.

slide-59
SLIDE 59

s1 s2 si sN s1 s2 si sN s1 s2 sj sN s1 s2 si sN

a1j a2j aij aNj

Time= 1 k k+1 K

  • 1 ok ok+1 oK = Observations

Trellis representation of an HMM

slide-60
SLIDE 60
  • Initialization:

α1(i)= P(o1 , q1= si ) = πi bi (o1) , 1<=i<=N.

  • Forward recursion:

αk+1(i)= P(o1 o2 ... ok+1 , qk+1= sj ) =

Σi P(o1 o2 ... ok+1 , qk= si , qk+1= sj ) = Σi P(o1 o2 ... ok , qk= si) aij bj (ok+1 ) = [Σi αk(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1.

  • Termination:

P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , qK= si) = Σi αK(i)

  • Complexity :

N2K operations.

Forward recursion for HMM

slide-61
SLIDE 61
  • Define the forward variable βk(i) as the joint probability of the partial observation

sequence ok+1 ok+2 ... oK given that the hidden state at time k is si : βk(i)= P(ok+1

  • k+2 ... oK |qk= si )
  • Initialization:

βK(i)= 1 , 1<=i<=N.

  • Backward recursion:

βk(j)= P(ok+1 ok+2 ... oK | qk= sj ) =

Σi P(ok+1 ok+2 ... oK , qk+1= si | qk= sj ) = Σi P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 ) = Σi βk+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1.

  • Termination:

P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , q1= si) =

Σi P(o1 o2 ... oK |q1= si) P(q1= si) = Σi β1(i) bi (o1) πi

Backward recursion for HMM

slide-62
SLIDE 62
  • Decoding problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence.

  • We want to find the state sequence Q= q1…qK which maximizes P(Q | o1
  • 2 ... oK ) , or equivalently P(Q , o1 o2 ... oK ) .
  • Brute force consideration of all paths takes exponential time. Use efficient Viterbi

algorithm instead.

  • Define variable δk(i) as the maximum probability of producing observation sequence
  • 1 o2 ... ok when moving along any hidden state sequence q1… qk-1 and getting into

qk= si .

δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok)

where max is taken over all possible paths q1… qk-1 .

Decoding problem

slide-63
SLIDE 63
  • General idea:

if best path ending in qk= sj goes through qk-1= si then it should coincide with best path ending in qk-1= si .

s1 si sN sj

aij aNj a1j

qk-1 qk

  • δk(i) = max P(q1… qk-1 , qk= sj , o1 o2 ... ok) =

maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ]

  • To backtrack best path keep info that predecessor of sj was si.

Viterbi algorithm (1)

slide-64
SLIDE 64
  • Initialization:

δ1(i) = max P(q1= si , o1) = πi bi (o1) , 1<=i<=N.

  • Forward recursion:

δk(j) = max P(q1… qk-1 , qk= sj , o1 o2 ... ok) =

maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] = maxi [ aij bj (ok ) δk-1(i) ] , 1<=j<=N, 2<=k<=K.

  • Termination: choose best path ending at time K

maxi [ δK(i) ]

  • Backtrack best path.

This algorithm is similar to the forward recursion of evaluation problem, with Σ replaced by max and additional backtracking.

Viterbi algorithm (2)

slide-65
SLIDE 65
  • Learning problem. Given some training observation sequences O=o1 o2 ...
  • K and general structure of HMM (numbers of hidden and visible states), determine

HMM parameters M=(A, B, π) that best fit training data, that is maximizes

P(O | M) .

  • There is no algorithm producing optimal parameter values.
  • Use iterative expectation-maximization algorithm to find local maximum of P(O |

M) - Baum-Welch algorithm.

Learning problem (1)

slide-66
SLIDE 66
  • If training data has information about sequence of hidden states (as in word

recognition example), then use maximum likelihood estimation of parameters: aij= P(si | sj) = Number of transitions from state sj to state si

Number of transitions out of state sj

bi(vm ) = P(vm | si)=

Number of times observation vm occurs in state si Number of times in state si

Learning problem (2)

slide-67
SLIDE 67

General idea:

aij= P(si | sj) =

Expected number of transitions from state sj to state si

Expected number of transitions out of state sj

bi(vm ) = P(vm | si)=

Expected number of times observation vm occurs in state si

Expected number of times in state si

πi = P(si) = Expected frequency in state si at time k=1.

Baum-Welch algorithm

slide-68
SLIDE 68
  • Define variable ξk(i,j) as the probability of being in state si at time k and in state

sj at time k+1, given the observation sequence o1 o2 ... oK . ξk(i,j)= P(qk= si , qk+1= sj | o1 o2 ... oK) ξk(i,j)=

P(qk= si , qk+1= sj , o1 o2 ... ok) P(o1 o2 ... ok)

=

P(qk= si , o1 o2 ... ok) aij bj (ok+1 ) P(ok+2 ... oK | qk+1= sj ) P(o1 o2 ... ok)

=

αk(i) aij bj (ok+1 ) βk+1(j) Σi Σj αk(i) aij bj (ok+1 ) βk+1(j)

Baum-Welch algorithm: expectation step(1)

slide-69
SLIDE 69
  • Define variable γk(i) as the probability of being in state si at time k, given the
  • bservation sequence o1 o2 ... oK .

γk(i)= P(qk= si | o1 o2 ... oK) γk(i)=

P(qk= si , o1 o2 ... ok) P(o1 o2 ... ok)

=

αk(i) βk(i) Σi αk(i) βk(i)

Baum-Welch algorithm: expectation step(2)

slide-70
SLIDE 70
  • We calculated ξk(i,j) = P(qk= si , qk+1= sj | o1 o2 ... oK)

and γk(i)= P(qk= si | o1 o2 ... oK)

  • Expected number of transitions from state si to state sj =

= Σk ξk(i,j)

  • Expected number of transitions out of state si = Σk γk(i)
  • Expected number of times observation vm occurs in state si =

= Σk γk(i) , k is such that ok= vm

  • Expected frequency in state si at time k=1 : γ1(i) .

Baum-Welch algorithm: expectation step(3)

slide-71
SLIDE 71

aij =

Expected number of transitions from state sj to state si

Expected number of transitions out of state sj

bi(vm ) =

Expected number of times observation vm occurs in state si

Expected number of times in state si

πi = (Expected frequency in state si at time k=1) = γ1(i). = Σk ξk(i,j) Σk γk(i) = Σk ξk(i,j) Σk,ok= vm γk(i)

Baum-Welch algorithm: maximization step

slide-72
SLIDE 72

The Noisy Channel Model

— Search through space of all possible

sentences.

— Pick the one that is most probable given

the waveform.

slide-73
SLIDE 73

The Noisy Channel Model (II)

— What is the most likely sentence out of

all sentences in the language L given some acoustic input O?

— Treat acoustic input O as sequence of

individual observations

  • O = o1,o2,o3,…,ot

— Define a sentence as a sequence of

words:

  • W = w1,w2,w3,…,wn
slide-74
SLIDE 74

Noisy Channel Model (III)

— Probabilistic implication: Pick the highest prob S: — We can use Bayes rule to rewrite this: — Since denominator is the same for each candidate

sentence W, we can ignore it for the argmax: ˆ W = argmax

W ∈L

P(W |O)

ˆ W = argmax

W ∈L

P(O |W )P(W )

ˆ W = argmax

W ∈L

P(O |W )P(W ) P(O)

slide-75
SLIDE 75

Noisy channel model

ˆ W = argmax

W ∈L

P(O |W )P(W )

likelihood prior

slide-76
SLIDE 76

The noisy channel model

— Ignoring the denominator leaves us with

two factors: P(Source) and P(Signal| Source)

slide-77
SLIDE 77

Speech Architecture meets Noisy Channel

slide-78
SLIDE 78

HMMs for speech

slide-79
SLIDE 79

Phones are not homogeneous!

Time (s) 0.48152 0.937203 5000 ay k

slide-80
SLIDE 80

Each phone has 3 subphones

slide-81
SLIDE 81

Resulting HMM word model for “six”

slide-82
SLIDE 82

HMMs more formally

— Markov chains — A kind of weighted finite-state automaton

slide-83
SLIDE 83

HMMs more formally

— Markov chains — A kind of weighted finite-state automaton

slide-84
SLIDE 84

Another Markov chain

slide-85
SLIDE 85

Another view of Markov chains

slide-86
SLIDE 86

An example with numbers:

— What is probability of:

  • Hot hot hot hot
  • Cold hot cold hot
slide-87
SLIDE 87

Hidden Markov Models

slide-88
SLIDE 88

Hidden Markov Models

slide-89
SLIDE 89

Hidden Markov Models

— Bakis network Ergodic (fully-connected)

network

— Left-to-right network

slide-90
SLIDE 90

The Jason Eisner task

— You are a climatologist in 2799 studying the

history of global warming

— YOU can’t find records of the weather in

Baltimore for summer 2006

— But you do find Jason Eisner’s diary — Which records how many ice creams he ate each

day.

— Can we use this to figure out the weather?

  • Given a sequence of observations O,

– each observation an integer = number of ice creams eaten – Figure out correct hidden sequence Q of weather states (H

  • r C) which caused Jason to eat the ice cream
slide-91
SLIDE 91
slide-92
SLIDE 92

HMMs more formally

— Three fundamental problems

  • Jack Ferguson at IDA in the 1960s

1) Given a specific HMM, determine likelihood

  • f observation sequence.

2) Given an observation sequence and an HMM, discover the best (most probable) hidden state sequence 3) Given only an observation sequence, learn the HMM parameters (A, B matrix)

slide-93
SLIDE 93

The Three Basic Problems for HMMs

— Problem 1 (Evaluation): Given the observation

sequence O=(o1o2…oT), and an HMM model Φ = (A,B), how do we efficiently compute P(O| Φ), the probability

  • f the observation sequence, given the model

— Problem 2 (Decoding): Given the observation sequence

O=(o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2… qT) that is optimal in some sense (i.e., best explains the

  • bservations)

— Problem 3 (Learning): How do we adjust the model

parameters Φ = (A,B) to maximize P(O| Φ )?

slide-94
SLIDE 94

Problem 1: computing the

  • bservation likelihood

— Given the following HMM: — How likely is the sequence 3 1 3?

slide-95
SLIDE 95

How to compute likelihood

— For a Markov chain, we just follow the

states 3 1 3 and multiply the probabilities

— But for an HMM, we don’t know what

the states are!

— So let’s start with a simpler situation. — Computing the observation likelihood for

a given hidden state sequence

  • Suppose we knew the weather and wanted to

predict how much ice cream Jason would eat.

  • I.e. P( 3 1 3 | H H C)
slide-96
SLIDE 96

Computing likelihood for 1 given hidden state sequence

slide-97
SLIDE 97

Computing total likelihood of 3 1 3

— We would need to sum over

  • Hot hot cold
  • Hot hot hot
  • Hot cold hot
  • ….

— How many possible hidden state sequences are

there for this sequence?

— How about in general for an HMM with N

hidden states and a sequence of T observations?

  • NT

— So we can’t just do separate computation for

each hidden state sequence.

slide-98
SLIDE 98

Instead: the Forward algorithm

— A kind of dynamic programming algorithm

  • Uses a table to store intermediate values

— Idea:

  • Compute the likelihood of the observation sequence
  • By summing over all possible hidden state sequences
  • But doing this efficiently

– By folding all the sequences into a single trellis

slide-99
SLIDE 99

The Forward Trellis

slide-100
SLIDE 100

The forward algorithm

— Each cell of the forward algorithm trellis

alphat(j)

  • Represents the probability of being in state j
  • After seeing the first t observations
  • Given the automaton

— Each cell thus expresses the following

probability

slide-101
SLIDE 101

We update each cell

slide-102
SLIDE 102

The Forward Recursion

slide-103
SLIDE 103

The Forward Algorithm

slide-104
SLIDE 104

Decoding

— Given an observation sequence

  • 3 1 3

— And an HMM — The task of the decoder

  • To find the best hidden state sequence

— Given the observation sequence O=(o1o2…

  • T), and an HMM model Φ = (A,B), how do

we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations)

slide-105
SLIDE 105

Decoding

— One possibility:

  • For each hidden state sequence

– HHH, HHC, HCH,

  • Run the forward algorithm to compute P(Φ |

O)

— Why not?

  • NT

— Instead:

  • The

Viterbi algorithm

  • Is again a dynamic programming algorithm
  • Uses a similar trellis to the Forward algorithm
slide-106
SLIDE 106

The Viterbi trellis

slide-107
SLIDE 107

Viterbi intuition

— Process observation sequence left to

right

— Filling out the trellis — Each cell:

slide-108
SLIDE 108

Viterbi Algorithm

slide-109
SLIDE 109

Viterbi backtrace

slide-110
SLIDE 110

Viterbi Recursion

slide-111
SLIDE 111

Reminder: a word looks like this:

slide-112
SLIDE 112

HMM for digit recognition task

slide-113
SLIDE 113

The Evaluation (forward) problem for speech

— The observation sequence O is a series of

MFCC vectors

— The hidden states W are the phones and

words

— For a given phone/word string W, our job

is to evaluate P(O|W)

— Intuition: how likely is the input to have

been generated by just that word string W

slide-114
SLIDE 114

Evaluation for speech: Summing

  • ver all different paths!

— f ay ay ay ay v v v v — f f ay ay ay ay v v v — f f f f ay ay ay ay v — f f ay ay ay ay ay ay v — f f ay ay ay ay ay ay ay ay v — f f ay v v v v v v v

slide-115
SLIDE 115

The forward lattice for “five”

slide-116
SLIDE 116

The forward trellis for “five”

slide-117
SLIDE 117

Viterbi trellis for “five”

slide-118
SLIDE 118

Viterbi trellis for “five”

slide-119
SLIDE 119

Search space with bigrams

slide-120
SLIDE 120

Viterbi trellis with 2 words and uniform LM

slide-121
SLIDE 121

Viterbi backtrace

slide-122
SLIDE 122

Part-of-speech tagging

slide-123
SLIDE 123

Parts of Speech

— Perhaps starting with Aristotle in the West

(384–322 BCE) the idea of having parts of speech

  • lexical categories, word classes, “tags”, POS

— Dionysius Thrax of Alexandria (c. 100 BCE):

8 parts of speech

  • Still with us! But his 8 aren’t exactly the ones we

are taught today

– Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun – School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

slide-124
SLIDE 124

Open class (lexical) words Closed class (functional) Nouns Verbs Proper Common Modals Main Adjectives Adverbs Prepositions Particles Determiners Conjunctions Pronouns … more … more

IBM Italy cat / cats snow see registered can had

  • ld older oldest

slowly to with

  • ff up

the some and or he its

Numbers

122,312

  • ne

Interjections Ow Eh

slide-125
SLIDE 125

Open vs. Closed classes

— Open vs. Closed classes

  • Closed:

– determiners: a, an, the – pronouns: she, he, I – prepositions: on, under, over, near, by, … – Why “closed”?

  • Open:

– Nouns, Verbs, Adjectives, Adverbs.

slide-126
SLIDE 126

POS Tagging

— Words often have more than one POS: back

  • The back door = JJ
  • On my back = NN
  • Win the voters back = RB
  • Promised to back the bill =

VB

— The POS tagging problem is to determine the

POS tag for a particular instance of a word.

slide-127
SLIDE 127

POS Tagging

— Input:

Plays well with others

— Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS — Output: Plays/VBZ well/RB with/IN others/NNS — Uses:

  • MT: reordering of adjectives and nouns (say from Spanish to

English)

  • Text-to-speech (how do we pronounce “lead”?)
  • Can write regexps like (Det) Adj* N+ over the output for

phrases, etc.

  • Input to a syntactic parser

Penn Treebank POS tags

slide-128
SLIDE 128

The Penn TreeBank Tagset

128

slide-129
SLIDE 129

Penn Treebank tags

129

slide-130
SLIDE 130

POS tagging performance

— How many tags are correct? (Tag accuracy)

  • About 97% currently
  • But baseline is already 90%

– Baseline is performance of stupidest possible method

– Tag every word with its most frequent tag – Tag unknown words as nouns

  • Partly easy because

– Many words are unambiguous – You get points for them (the, a, etc.) and for punctuation marks!

slide-131
SLIDE 131

Deciding on the correct part of speech can be difficult even for people

— Mrs/NNP Shaefer/NNP never/RB got/

VBD around/RP to/TO joining/VBG

— All/DT we/PRP gotta/VBN do/VB is/VBZ

go/VB around/IN the/DT corner/NN

— Chateau/NNP Petrus/NNP costs/VBZ

around/RB 250/CD

slide-132
SLIDE 132

How difficult is POS tagging?

— About 11% of the word types in the

Brown corpus are ambiguous with regard to part of speech

— But they tend to be very common words.

E.g., that

  • I know that he is honest = IN
  • Yes, that play was nice = DT
  • You can’t go that far = RB

— 40% of the word tokens are ambiguous

slide-133
SLIDE 133

Sources of information

— What are the main sources of information

for POS tagging?

  • Knowledge of neighboring words

– Bill saw that man yesterday – NNP NN DT NN NN – VB VB(D) IN VB NN

  • Knowledge of word probabilities

– man is rarely used as a verb….

— The latter proves the most useful, but the

former also helps

slide-134
SLIDE 134

More and Better Features è Feature-based tagger

— Can do surprisingly well just looking at a

word by itself:

  • Word

the: the → DT

  • Lowercased word

Importantly: importantly → RB

  • Prefixes

unfathomable: un- → JJ

  • Suffixes

Importantly: -ly → RB

  • Capitalization Meridian: CAP → NNP
  • Word shapes 35-year: d-x → JJ

— Then build a classifier to predict tag

  • Maxent P(t|w): 93.7% overall / 82.6% unknown
slide-135
SLIDE 135

Overview: POS Tagging Accuracies

— Rough accuracies:

  • Most freq tag:

~90% / ~50%

  • Trigram HMM:

~95% / ~55%

  • Maxent P(t|w):

93.7% / 82.6%

  • TnT (HMM++):

96.2% / 86.0%

  • MEMM tagger:

96.9% / 86.9%

  • Bidirectional dependencies:

97.2% / 90.0%

  • Upper bound:

~98% (human agreement) Most errors

  • n unknown

words

slide-136
SLIDE 136

POS tagging as a sequence classification task

— We are given a sentence (an “observation”

  • r “sequence of observations”)
  • Secretariat is expected to race tomorrow
  • She promised to back the bill

— What is the best sequence of tags which

corresponds to this sequence of

  • bservations?

— Probabilistic view:

  • Consider all possible sequences of tags
  • Out of this universe of sequences, choose the tag

sequence which is most probable given the

  • bservation sequence of n words w1…wn.
slide-137
SLIDE 137

How do we apply classification to sequences?

slide-138
SLIDE 138

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier NNP

slide-139
SLIDE 139

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier VBD

slide-140
SLIDE 140

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier DT

slide-141
SLIDE 141

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier NN

slide-142
SLIDE 142

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier CC

slide-143
SLIDE 143

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier VBD

slide-144
SLIDE 144

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier TO

slide-145
SLIDE 145

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier VB

slide-146
SLIDE 146

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier PRP

slide-147
SLIDE 147

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier IN

slide-148
SLIDE 148

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier DT

slide-149
SLIDE 149

Sequence Labeling as Classification

— Classify each token independently but use

as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier NN

slide-150
SLIDE 150

Sequence Labeling as Classification Using Outputs as Inputs

— Better input features are usually the

categories of the surrounding tokens, but these are not available yet.

— Can use category of either the preceding

  • r succeeding tokens by going forward or

back and using previous output.

Slide from Ray Mooney

slide-151
SLIDE 151

Forward Classification

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier NNP

slide-152
SLIDE 152

Forward Classification

Slide from Ray Mooney

NNP

John saw the saw and decided to take it to the table.

classifier VBD

slide-153
SLIDE 153

Forward Classification

Slide from Ray Mooney

NNP VBD John saw the saw and decided to take it to the table.

classifier DT

slide-154
SLIDE 154

Forward Classification

Slide from Ray Mooney

NNP

VBD DT John saw the saw and decided to take it to the table.

classifier NN

slide-155
SLIDE 155

Forward Classification

Slide from Ray Mooney

NNP VBD DT NN John saw the saw and decided to take it to the table.

classifier CC

slide-156
SLIDE 156

Forward Classification

Slide from Ray Mooney

NNP

VBD DT NN CC John saw the saw and decided to take it to the table.

classifier VBD

slide-157
SLIDE 157

Forward Classification

Slide from Ray Mooney

NNP

VBD DT NN CC VBD John saw the saw and decided to take it to the table.

classifier TO

slide-158
SLIDE 158

Forward Classification

Slide from Ray Mooney

NNP

VBD DT NN CC VBD TO John saw the saw and decided to take it to the table.

classifier VB

slide-159
SLIDE 159

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

DT NN John saw the saw and decided to take it to the table.

classifier IN

slide-160
SLIDE 160

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

IN DT NN John saw the saw and decided to take it to the table.

classifier PRP

slide-161
SLIDE 161

Backward Classification

— Disambiguating “to” in this case would be even

easier backward.

PRP IN DT NN John saw the saw and decided to take it to the table.

classifier VB

slide-162
SLIDE 162

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier TO

slide-163
SLIDE 163

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier VBD

slide-164
SLIDE 164

Backward Classification

— Disambiguating “to” in this case would be even

easier backward.

Slide from Ray Mooney

VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier CC

slide-165
SLIDE 165

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier VBD

slide-166
SLIDE 166

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier DT

slide-167
SLIDE 167

Backward Classification

— Disambiguating “to” in this case would be even

easier backward.

Slide from Ray Mooney

DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier VBD

slide-168
SLIDE 168

Backward Classification

— Disambiguating “to” in this case would be

even easier backward.

Slide from Ray Mooney

VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier NNP

slide-169
SLIDE 169

The Maximum Entropy Markov Model (MEMM)

— A sequence version of the logistic

regression (also called maximum entropy) classifier.

— Find the best series of tags:

169

slide-170
SLIDE 170

The Maximum Entropy Markov Model (MEMM)

170

will

MD VB

Janet back the bill

NNP

<s>

wi wi+1 wi-1 ti-1 ti-2 wi-1

slide-171
SLIDE 171

Features for the classifier at each tag

171

will

MD VB

Janet back the bill

NNP

<s>

wi wi+1 wi-1 ti-1 ti-2 wi-1

slide-172
SLIDE 172

More features

172

slide-173
SLIDE 173

MEMM computes the best tag sequence

173

slide-174
SLIDE 174

MEMM Decoding

— Simplest algorithm: — What we use in practice: The Viterbi

algorithm

— A version of the same dynamic programming

algorithm we used to compute minimum edit distance.

174

slide-175
SLIDE 175

The Stanford Tagger

— Is a bidirectional version of the MEMM

called a cyclic dependency network

— Stanford tagger:

  • http://nlp.stanford.edu/software/tagger.shtml

175