Hidden Markov Models COSI 114 Computational Linguistics James - - PowerPoint PPT Presentation

hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models COSI 114 Computational Linguistics James - - PowerPoint PPT Presentation

Hidden Markov Models COSI 114 Computational Linguistics James Pustejovsky February10, 2015 Brandeis University Slides thanks to David Blei Markov Models { s , s , , s } Set of states: 1 2 N Process moves from one


slide-1
SLIDE 1

Hidden Markov Models

COSI 114 – Computational Linguistics James Pustejovsky February10, 2015 Brandeis University

Slides thanks to David Blei

slide-2
SLIDE 2
  • Set of states:
  • Process moves from one state to another generating a

sequence of states :

  • Markov chain property: probability of each subsequent state

depends only on what was the previous state:

  • To define Markov model, the following probabilities have to be

specified: transition probabilities and initial probabilities

Markov Models

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

) | (

j i ij

s s P a =

) ( i

i

s P = π

slide-3
SLIDE 3

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-4
SLIDE 4
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-5
SLIDE 5

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-6
SLIDE 6

Low High 0.7 0.3 0.2 0.8 Dry Rain

0.6 0.6 0.4 0.4

Example of Hidden Markov Model

slide-7
SLIDE 7

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-8
SLIDE 8
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-9
SLIDE 9

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-10
SLIDE 10
  • Two states : ‘Low’ and ‘High’ atmospheric pressure.
  • Two observations : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

  • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .

  • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

slide-11
SLIDE 11

What is an HMM?

— Graphical Model — Circles indicate states — Arrows indicate probabilistic dependencies

between states

slide-12
SLIDE 12

What is an HMM?

— Green circles are hidden states — Dependent only on the previous state — “The past is independent of the future given the

present.”

slide-13
SLIDE 13

What is an HMM?

— Purple nodes are observed states — Dependent only on their corresponding hidden

state

slide-14
SLIDE 14

HMM Formalism

— {S, K, Π, Α, Β} — S : {s1…sN } are the values for the hidden states — K : {k1…kM } are the values for the observations

S S S K K K S K S K

slide-15
SLIDE 15

HMM Formalism

— {S, K, Π, Α, Β} — Π = {πι} are the initial state probabilities — A = {aij} are the state transition probabilities — B = {bik} are the observation state probabilities

A B A A A B B S S S K K K S K S K

slide-16
SLIDE 16

Inference in an HMM

— Compute the probability of a given observation

sequence

— Given an observation sequence, compute the

most likely hidden state sequence

— Given an observation sequence and set of

possible models, which model most closely fits the data?

slide-17
SLIDE 17

) | ( Compute ) , , ( , ) ... ( 1 µ µ O P B A

  • O

T

Π = =

  • T
  • 1
  • t
  • t-1
  • t+1

Given an observation sequence and a model, compute the probability of the observation sequence

Decoding

slide-18
SLIDE 18

Decoding

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-19
SLIDE 19

Decoding

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-20
SLIDE 20

Decoding

) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-21
SLIDE 21

Decoding

) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

= π µ

=

X

X P X O P O P ) | ( ) , | ( ) | ( µ µ µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-22
SLIDE 22

1 1 1 1 1 1 1

1 1 } ... {

) | (

+ + +

Π ∑

− =

=

t t t t T

  • x

x x T t x x

  • x

x

b a b O P π µ

Decoding

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

slide-23
SLIDE 23

) | , ... ( ) (

1

µ α i x

  • P

t

t t i

= =

Forward Procedure

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

  • Special structure gives us an efficient solution

using dynamic programming.

  • Intuition: Probability of the first t observations is

the same for all possible t+1 length state sequences.

  • Define:
slide-24
SLIDE 24

) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α

slide-25
SLIDE 25
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-26
SLIDE 26
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-27
SLIDE 27
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

) 1 ( + t

j

α ) | ( ) , ... ( ) ( ) | ( ) | ... ( ) ( ) | ... ( ) , ... (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

j x P j x

  • P

j x

  • P

t t t t t t t t t t t t t t

= = = = = = = = = = = =

+ + + + + + + + + + + +

slide-28
SLIDE 28

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-29
SLIDE 29

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-30
SLIDE 30

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-31
SLIDE 31

∑ ∑ ∑ ∑

= + + + = + + = + + + = +

+

= = = = = = = = = = = = = = =

N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t

t

b a t j x

  • P

i x j x P i x

  • P

j x

  • P

i x P i x j x

  • P

j x

  • P

j x i x

  • P

... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1

1

) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Forward Procedure

slide-32
SLIDE 32

) | ... ( ) ( i x

  • P

t

t T t i

= = β

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Backward Procedure

1 ) 1 ( = + T

i

β

=

+ =

N j j io ij i

t b a t

t

... 1

) 1 ( ) ( β β

Probability of the rest

  • f the states given the

first state

slide-33
SLIDE 33
  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Decoding Solution

=

=

N i i T

O P

1

) ( ) | ( α µ

=

=

N i i i

O P

1

) 1 ( ) | ( β π µ ) ( ) ( ) | (

1

t t O P

i N i i

β α µ

=

=

Forward Procedure Backward Procedure Combination

slide-34
SLIDE 34
  • T
  • 1
  • t
  • t-1
  • t+1

Best State Sequence

— Find the state sequence that best explains the observations — Viterbi algorithm —

) | ( max arg O X P

X

slide-35
SLIDE 35
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) , , ... , ... ( max ) (

1 1 1 1 ...

1 1

t t t t x x j

  • j

x

  • x

x P t

t

= =

− −

δ

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the

  • bservation at time t

x1 xt-1 j

slide-36
SLIDE 36
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) , , ... , ... ( max ) (

1 1 1 1 ...

1 1

t t t t x x j

  • j

x

  • x

x P t

t

= =

− −

δ

1

) ( max ) 1 (

+

= +

t

jo ij i i j

b a t t δ δ

1

) ( max arg ) 1 (

+

= +

t

jo ij i i j

b a t t δ ψ

Recursive Computation

x1 xt-1 xt xt+1

slide-37
SLIDE 37
  • T
  • 1
  • t
  • t-1
  • t+1

Viterbi Algorithm

) ( max arg ˆ T X

i i T

δ =

) 1 ( ˆ

1 ^

+ =

+ t

X

t

X t

ψ

) ( max arg ) ˆ ( T X P

i i

δ =

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

slide-38
SLIDE 38
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

  • Given an observation sequence, find the model

that is most likely to produce that sequence.

  • No analytic method
  • Given a model and observation sequence, update

the model parameters to better fit the

  • bservations.

A B A A A B B B B

slide-39
SLIDE 39
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

A B A A A B B B B

=

+ =

+

N m m m j jo ij i t

t t t b a t j i p

t

... 1

) ( ) ( ) 1 ( ) ( ) , (

1

β α β α

Probability of traversing an arc

=

=

N j t i

j i p t

... 1

) , ( ) ( γ

Probability of being in state i

slide-40
SLIDE 40
  • T
  • 1
  • t
  • t-1
  • t+1

Parameter Estimation

A B A A A B B B B

) 1 ( ˆ

i

γ π =

i

Now we can compute the new estimates of the model parameters.

∑ ∑

= =

=

T t i T t t ij

t j i p a

1 1

) ( ) , ( ˆ γ

∑ ∑

= =

=

T t i k

  • t

t ik

t i b

t

1 } : {

) ( ) ( ˆ γ γ

slide-41
SLIDE 41

HMM Applications

— Generating parameters for n-gram models — Tagging speech — Speech recognition

slide-42
SLIDE 42
  • T
  • 1
  • t
  • t-1
  • t+1

The Most Important Thing

A B A A A B B B B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

slide-43
SLIDE 43

Low High 0.7 0.3 0.2 0.8 Dry Rain

0.6 0.6 0.4 0.4

Example of Hidden Markov Model

slide-44
SLIDE 44

Rain Dry 0.7 0.3 0.2 0.8

  • Two states : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

  • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

slide-45
SLIDE 45
  • By Markov chain property, probability of state sequence can be found by the

formula:

  • Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

) ( ) | ( ) | ( ) | ( ) , , , ( ) | ( ) , , , ( ) , , , | ( ) , , , (

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i

s P s s P s s P s s P s s s P s s P s s s P s s s s P s s s P … … … … … …

− − − − − − −

= = = =

slide-46
SLIDE 46

Hidden Markov models

  • Set of states:
  • Process moves from one state to another generating a sequence of states :
  • Markov chain property: probability of each subsequent state depends only on

what was the previous state:

  • States are not visible, but each state randomly generates one of M
  • bservations (or visible states)
  • To define hidden Markov model, the following probabilities have to be

specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix

  • f observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a

vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

} , , , {

2 1 N

s s s … … … , , , ,

2 1 ik i i

s s s ) | ( ) , , , | (

1 1 2 1 − −

=

ik ik ik i i ik

s s P s s s s P …

} , , , {

2 1 M

v v v …

slide-47
SLIDE 47
  • Two states : ‘Low’ and ‘High’ atmospheric pressure.
  • Two observations : ‘Rain’ and ‘Dry’.
  • Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

  • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .

  • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

slide-48
SLIDE 48
  • Suppose we want to calculate a probability of a sequence of observations in our

example, {‘Dry’,’Rain’}.

  • Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} ,

{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.4*0.4*0.6*0.4*0.3 Calculation of observation sequence probability

slide-49
SLIDE 49

Evaluation problem. Given the HMM M=(A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .

  • Decoding problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence O.

  • Learning problem. Given some training observation sequences O=o1 o2 ... oK

and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, π) that best fit training data.

O=o1...oK denotes a sequence of observations ok∈{v1,…,vM}.

Main issues using HMMs :

slide-50
SLIDE 50
  • Typed word recognition, assume all characters are separated.
  • Character recognizer outputs probability of the image being particular character,

P(image|character). 0.5 0.03 0.005 0.31 z c b a

Word recognition example(1).

Hidden state Observation

slide-51
SLIDE 51
  • Hidden states of HMM = characters.
  • Observations = typed images of characters segmented from the

image . Note that there is an infinite number of observations

  • Observation probabilities = character recognizer scores.
  • Transition probabilities will be defined differently in two subsequent models.

Word recognition example(2).

( ) ( )

) | ( ) (

i i

s v P v b B

α α

= =

α

v

slide-52
SLIDE 52
  • If lexicon is given, we can construct separate HMM models for each lexicon word.

Amherst a m h e r s t Buffalo b u f f a l

  • 0.5

0.03

  • Here recognition of word image is equivalent to the problem of evaluating few

HMM models.

  • This is an application of Evaluation problem.

Word recognition example(3).

0.4 0.6

slide-53
SLIDE 53
  • We can construct a single HMM for all words.
  • Hidden states = all characters in the alphabet.
  • Transition probabilities and initial probabilities are calculated from language model.
  • Observations and observation probabilities are as before.

a m h e r s t b v f

  • Here we have to determine the best sequence of hidden states, the one that most

likely produced word image.

  • This is an application of Decoding problem.

Word recognition example(4).

slide-54
SLIDE 54
  • The structure of hidden states is chosen.
  • Observations are feature vectors extracted from vertical slices.
  • Probabilistic mapping from hidden state to feature vectors:
  • 1. use mixture of

Gaussian models

  • 2. Quantize feature vector space.

Character recognition with HMM example.

slide-55
SLIDE 55
  • The structure of hidden states:
  • Observation = number of islands in the vertical slice.

s1 s2 s3

  • HMM for character ‘A’ :

Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ .1 .8 .1 ⏐ ⎝ .9 .1 0 ⎠

  • HMM for character ‘B’ :

Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ 0 .2 .8 ⏐ ⎝ .6 .4 0 ⎠

Exercise: character recognition with HMM(1)

slide-56
SLIDE 56
  • Suppose that after character image segmentation the following sequence of island

numbers in 4 slices was observed: { 1, 3, 2, 1}

  • What HMM is more likely to generate this observation sequence , HMM for

‘A’ or HMM for ‘B’ ?

Exercise: character recognition with HMM(2)

slide-57
SLIDE 57

Consider likelihood of generating given observation for each possible sequence of hidden states:

  • HMM for character ‘A’:

Hidden state sequence Transition probabilities Observation probabilities

s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .8 * .9 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .1 * .8 * .9 = 0.0020736 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .1 * .1 * .9 = 0.000324 Total = 0.0023976

  • HMM for character ‘B’:

Hidden state sequence Transition probabilities Observation probabilities

s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .2 * .6 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .8 * .2 * .6 = 0.0027648 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .8 * .4 * .6 = 0.006912 Total = 0.0096768

Exercise: character recognition with HMM(3)

slide-58
SLIDE 58
  • Evaluation problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .

  • Trying to find probability of observations O=o1 o2 ... oK by means of considering all

hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity.

  • Use Forward-Backward HMM algorithms for efficient calculations.
  • Define the forward variable αk(i) as the joint probability of the partial observation

sequence o1 o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1 o2 ...

  • k , qk= si )

Evaluation Problem.

slide-59
SLIDE 59

s1 s2 si sN s1 s2 si sN s1 s2 sj sN s1 s2 si sN

a1j a2j aij aNj

Time= 1 k k+1 K

  • 1 ok ok+1 oK = Observations

Trellis representation of an HMM

slide-60
SLIDE 60
  • Initialization:

α1(i)= P(o1 , q1= si ) = πi bi (o1) , 1<=i<=N.

  • Forward recursion:

αk+1(i)= P(o1 o2 ... ok+1 , qk+1= sj ) =

Σi P(o1 o2 ... ok+1 , qk= si , qk+1= sj ) = Σi P(o1 o2 ... ok , qk= si) aij bj (ok+1 ) = [Σi αk(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1.

  • Termination:

P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , qK= si) = Σi αK(i)

  • Complexity :

N2K operations.

Forward recursion for HMM

slide-61
SLIDE 61
  • Define the forward variable βk(i) as the joint probability of the partial observation

sequence ok+1 ok+2 ... oK given that the hidden state at time k is si : βk(i)= P(ok+1

  • k+2 ... oK |qk= si )
  • Initialization:

βK(i)= 1 , 1<=i<=N.

  • Backward recursion:

βk(j)= P(ok+1 ok+2 ... oK | qk= sj ) =

Σi P(ok+1 ok+2 ... oK , qk+1= si | qk= sj ) = Σi P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 ) = Σi βk+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1.

  • Termination:

P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , q1= si) =

Σi P(o1 o2 ... oK |q1= si) P(q1= si) = Σi β1(i) bi (o1) πi

Backward recursion for HMM

slide-62
SLIDE 62
  • Decoding problem. Given the HMM M=(A, B, π) and the observation

sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence.

  • We want to find the state sequence Q= q1…qK which maximizes P(Q | o1
  • 2 ... oK ) , or equivalently P(Q , o1 o2 ... oK ) .
  • Brute force consideration of all paths takes exponential time. Use efficient Viterbi

algorithm instead.

  • Define variable δk(i) as the maximum probability of producing observation sequence
  • 1 o2 ... ok when moving along any hidden state sequence q1… qk-1 and getting into

qk= si .

δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok)

where max is taken over all possible paths q1… qk-1 .

Decoding problem

slide-63
SLIDE 63
  • General idea:

if best path ending in qk= sj goes through qk-1= si then it should coincide with best path ending in qk-1= si .

s1 si sN sj

aij aNj a1j

qk-1 qk

  • δk(i) = max P(q1… qk-1 , qk= sj , o1 o2 ... ok) =

maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ]

  • To backtrack best path keep info that predecessor of sj was si.

Viterbi algorithm (1)

slide-64
SLIDE 64
  • Initialization:

δ1(i) = max P(q1= si , o1) = πi bi (o1) , 1<=i<=N.

  • Forward recursion:

δk(j) = max P(q1… qk-1 , qk= sj , o1 o2 ... ok) =

maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] = maxi [ aij bj (ok ) δk-1(i) ] , 1<=j<=N, 2<=k<=K.

  • Termination: choose best path ending at time K

maxi [ δK(i) ]

  • Backtrack best path.

This algorithm is similar to the forward recursion of evaluation problem, with Σ replaced by max and additional backtracking.

Viterbi algorithm (2)

slide-65
SLIDE 65
  • Learning problem. Given some training observation sequences O=o1 o2 ...
  • K and general structure of HMM (numbers of hidden and visible states), determine

HMM parameters M=(A, B, π) that best fit training data, that is maximizes

P(O | M) .

  • There is no algorithm producing optimal parameter values.
  • Use iterative expectation-maximization algorithm to find local maximum of P(O |

M) - Baum-Welch algorithm.

Learning problem (1)

slide-66
SLIDE 66
  • If training data has information about sequence of hidden states (as in word

recognition example), then use maximum likelihood estimation of parameters: aij= P(si | sj) = Number of transitions from state sj to state si

Number of transitions out of state sj

bi(vm ) = P(vm | si)=

Number of times observation vm occurs in state si Number of times in state si

Learning problem (2)

slide-67
SLIDE 67

General idea:

aij= P(si | sj) =

Expected number of transitions from state sj to state si

Expected number of transitions out of state sj

bi(vm ) = P(vm | si)=

Expected number of times observation vm occurs in state si

Expected number of times in state si

πi = P(si) = Expected frequency in state si at time k=1.

Baum-Welch algorithm

slide-68
SLIDE 68
  • Define variable ξk(i,j) as the probability of being in state si at time k and in state

sj at time k+1, given the observation sequence o1 o2 ... oK . ξk(i,j)= P(qk= si , qk+1= sj | o1 o2 ... oK) ξk(i,j)=

P(qk= si , qk+1= sj , o1 o2 ... ok) P(o1 o2 ... ok)

=

P(qk= si , o1 o2 ... ok) aij bj (ok+1 ) P(ok+2 ... oK | qk+1= sj ) P(o1 o2 ... ok)

=

αk(i) aij bj (ok+1 ) βk+1(j) Σi Σj αk(i) aij bj (ok+1 ) βk+1(j)

Baum-Welch algorithm: expectation step(1)

slide-69
SLIDE 69
  • Define variable γk(i) as the probability of being in state si at time k, given the
  • bservation sequence o1 o2 ... oK .

γk(i)= P(qk= si | o1 o2 ... oK) γk(i)=

P(qk= si , o1 o2 ... ok) P(o1 o2 ... ok)

=

αk(i) βk(i) Σi αk(i) βk(i)

Baum-Welch algorithm: expectation step(2)

slide-70
SLIDE 70
  • We calculated ξk(i,j) = P(qk= si , qk+1= sj | o1 o2 ... oK)

and γk(i)= P(qk= si | o1 o2 ... oK)

  • Expected number of transitions from state si to state sj =

= Σk ξk(i,j)

  • Expected number of transitions out of state si = Σk γk(i)
  • Expected number of times observation vm occurs in state si =

= Σk γk(i) , k is such that ok= vm

  • Expected frequency in state si at time k=1 : γ1(i) .

Baum-Welch algorithm: expectation step(3)

slide-71
SLIDE 71

aij =

Expected number of transitions from state sj to state si

Expected number of transitions out of state sj

bi(vm ) =

Expected number of times observation vm occurs in state si

Expected number of times in state si

πi = (Expected frequency in state si at time k=1) = γ1(i). = Σk ξk(i,j) Σk γk(i) = Σk ξk(i,j) Σk,ok= vm γk(i)

Baum-Welch algorithm: maximization step

slide-72
SLIDE 72

The Noisy Channel Model

— Search through space of all possible

sentences.

— Pick the one that is most probable given

the waveform.

slide-73
SLIDE 73

The Noisy Channel Model (II)

— What is the most likely sentence out of

all sentences in the language L given some acoustic input O?

— Treat acoustic input O as sequence of

individual observations

  • O = o1,o2,o3,…,ot

— Define a sentence as a sequence of

words:

  • W = w1,w2,w3,…,wn
slide-74
SLIDE 74

Noisy Channel Model (III)

— Probabilistic implication: Pick the highest prob S: — We can use Bayes rule to rewrite this: — Since denominator is the same for each candidate

sentence W, we can ignore it for the argmax: ˆ W = argmax

W ∈L

P(W |O)

ˆ W = argmax

W ∈L

P(O |W )P(W )

ˆ W = argmax

W ∈L

P(O |W )P(W ) P(O)

slide-75
SLIDE 75

Noisy channel model

ˆ W = argmax

W ∈L

P(O |W )P(W )

likelihood prior

slide-76
SLIDE 76

The noisy channel model

— Ignoring the denominator leaves us with

two factors: P(Source) and P(Signal| Source)

slide-77
SLIDE 77

Speech Architecture meets Noisy Channel

slide-78
SLIDE 78

HMMs for speech

slide-79
SLIDE 79

Phones are not homogeneous!

Time (s) 0.48152 0.937203 5000 ay k

slide-80
SLIDE 80

Each phone has 3 subphones

slide-81
SLIDE 81

Resulting HMM word model for “six”

slide-82
SLIDE 82

HMMs more formally

— Markov chains — A kind of weighted finite-state automaton

slide-83
SLIDE 83

HMMs more formally

— Markov chains — A kind of weighted finite-state automaton

slide-84
SLIDE 84

Another Markov chain

slide-85
SLIDE 85

Another view of Markov chains

slide-86
SLIDE 86

An example with numbers:

— What is probability of:

  • Hot hot hot hot
  • Cold hot cold hot
slide-87
SLIDE 87

Hidden Markov Models

slide-88
SLIDE 88

Hidden Markov Models

slide-89
SLIDE 89

Hidden Markov Models

— Bakis network Ergodic (fully-connected)

network

— Left-to-right network

slide-90
SLIDE 90

The Jason Eisner task

— You are a climatologist in 2799 studying the

history of global warming

— YOU can’t find records of the weather in

Baltimore for summer 2006

— But you do find Jason Eisner’s diary — Which records how many ice creams he ate each

day.

— Can we use this to figure out the weather?

  • Given a sequence of observations O,

– each observation an integer = number of ice creams eaten – Figure out correct hidden sequence Q of weather states (H

  • r C) which caused Jason to eat the ice cream
slide-91
SLIDE 91
slide-92
SLIDE 92

HMMs more formally

— Three fundamental problems

  • Jack Ferguson at IDA in the 1960s

1) Given a specific HMM, determine likelihood

  • f observation sequence.

2) Given an observation sequence and an HMM, discover the best (most probable) hidden state sequence 3) Given only an observation sequence, learn the HMM parameters (A, B matrix)

slide-93
SLIDE 93

The Three Basic Problems for HMMs

— Problem 1 (Evaluation): Given the observation

sequence O=(o1o2…oT), and an HMM model Φ = (A,B), how do we efficiently compute P(O| Φ), the probability

  • f the observation sequence, given the model

— Problem 2 (Decoding): Given the observation sequence

O=(o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2… qT) that is optimal in some sense (i.e., best explains the

  • bservations)

— Problem 3 (Learning): How do we adjust the model

parameters Φ = (A,B) to maximize P(O| Φ )?

slide-94
SLIDE 94

Problem 1: computing the

  • bservation likelihood

— Given the following HMM: — How likely is the sequence 3 1 3?

slide-95
SLIDE 95

How to compute likelihood

— For a Markov chain, we just follow the

states 3 1 3 and multiply the probabilities

— But for an HMM, we don’t know what

the states are!

— So let’s start with a simpler situation. — Computing the observation likelihood for

a given hidden state sequence

  • Suppose we knew the weather and wanted to

predict how much ice cream Jason would eat.

  • I.e. P( 3 1 3 | H H C)
slide-96
SLIDE 96

Computing likelihood for 1 given hidden state sequence

slide-97
SLIDE 97

Computing total likelihood of 3 1 3

— We would need to sum over

  • Hot hot cold
  • Hot hot hot
  • Hot cold hot
  • ….

— How many possible hidden state sequences are

there for this sequence?

— How about in general for an HMM with N

hidden states and a sequence of T observations?

  • NT

— So we can’t just do separate computation for

each hidden state sequence.

slide-98
SLIDE 98

Instead: the Forward algorithm

— A kind of dynamic programming algorithm

  • Uses a table to store intermediate values

— Idea:

  • Compute the likelihood of the observation sequence
  • By summing over all possible hidden state sequences
  • But doing this efficiently

– By folding all the sequences into a single trellis

slide-99
SLIDE 99

The Forward Trellis

slide-100
SLIDE 100

The forward algorithm

— Each cell of the forward algorithm trellis

alphat(j)

  • Represents the probability of being in state j
  • After seeing the first t observations
  • Given the automaton

— Each cell thus expresses the following

probabilty

slide-101
SLIDE 101

We update each cell

slide-102
SLIDE 102

The Forward Recursion

slide-103
SLIDE 103

The Forward Algorithm

slide-104
SLIDE 104

Decoding

— Given an observation sequence

  • 3 1 3

— And an HMM — The task of the decoder

  • To find the best hidden state sequence

— Given the observation sequence O=(o1o2…

  • T), and an HMM model Φ = (A,B), how do

we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations)

slide-105
SLIDE 105

Decoding

— One possibility:

  • For each hidden state sequence

– HHH, HHC, HCH,

  • Run the forward algorithm to compute P(Φ |

O)

— Why not?

  • NT

— Instead:

  • The

Viterbi algorithm

  • Is again a dynamic programming algorithm
  • Uses a similar trellis to the Forward algorithm
slide-106
SLIDE 106

The Viterbi trellis

slide-107
SLIDE 107

Viterbi intuition

— Process observation sequence left to

right

— Filling out the trellis — Each cell:

slide-108
SLIDE 108

Viterbi Algorithm

slide-109
SLIDE 109

Viterbi backtrace

slide-110
SLIDE 110

Viterbi Recursion

slide-111
SLIDE 111

Why “Dynamic Programming”

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision

  • processes. An interesting question is, Where did the name, dynamic programming, come from?

The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his

  • presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation

was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.” Richard Bellman, “Eye of the Hurrican: an autobiography” 1984.

Thanks to Chen, Picheny, Eide, Nock

slide-112
SLIDE 112

HMMs for Speech

— We haven’t yet shown how to learn the

A and B matrices for HMMs; we’ll do that later today or possibly on Monday

— But let’s return to think about speech

slide-113
SLIDE 113

Reminder: a word looks like this:

slide-114
SLIDE 114

HMM for digit recognition task

slide-115
SLIDE 115

The Evaluation (forward) problem for speech

— The observation sequence O is a series of

MFCC vectors

— The hidden states W are the phones and

words

— For a given phone/word string W, our job

is to evaluate P(O|W)

— Intuition: how likely is the input to have

been generated by just that word string W

slide-116
SLIDE 116

Evaluation for speech: Summing

  • ver all different paths!

— f ay ay ay ay v v v v — f f ay ay ay ay v v v — f f f f ay ay ay ay v — f f ay ay ay ay ay ay v — f f ay ay ay ay ay ay ay ay v — f f ay v v v v v v v

slide-117
SLIDE 117

The forward lattice for “five”

slide-118
SLIDE 118

The forward trellis for “five”

slide-119
SLIDE 119

Viterbi trellis for “five”

slide-120
SLIDE 120

Viterbi trellis for “five”

slide-121
SLIDE 121

Search space with bigrams

slide-122
SLIDE 122

Viterbi trellis with 2 words and uniform LM

slide-123
SLIDE 123

Viterbi backtrace