The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 - - PowerPoint PPT Presentation

the hidden markov the hidden markov model hmm model hmm
SMART_READER_LITE
LIVE PREVIEW

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 - - PowerPoint PPT Presentation

Digital Speech Processing Digital Speech Processing Lecture 20 Lecture 20 The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline Theory of Markov Models discrete Markov processes


slide-1
SLIDE 1

1

Digital Speech Processing Digital Speech Processing— — Lecture 20 Lecture 20

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM)

slide-2
SLIDE 2

2

Lecture Outline Lecture Outline

  • Theory of Markov Models

– discrete Markov processes – hidden Markov processes

  • Solutions to the Three Basic Problems of HMM’s

– computation of observation probability – determination of optimal state sequence – optimal training of model

  • Variations of elements of the HMM

– model types – densities

  • Implementation Issues

– scaling – multiple observation sequences – initial parameter estimates – insufficient training data

  • Implementation of Isolated Word Recognizer Using HMM’s
slide-3
SLIDE 3

3

Stochastic Signal Modeling Stochastic Signal Modeling

  • Reasons for Interest:

– basis for theoretical description of signal processing algorithms – can learn about signal source properties – models work well in practice in real world applications

  • Types of Signal Models

– deteministic, parametric models – stochastic models

slide-4
SLIDE 4

4

Discrete Markov Processes Discrete Markov Processes

{ }

1 2

System of distinct states, , ,...,

N

N S S S

− − −

⎡ ⎤ ⎡ ⎤ = = = = = = ⎣ ⎦ ⎣ ⎦

1 2 3 4 5 1 2 1

Time( ) 1 2 3 4 5 ... State ... Markov Property: | , ,... |

t i t j t k t i t j

t q q q q q P q S q S q S P q S q S

slide-5
SLIDE 5

5

Properties of State Transition Coefficients Properties of State Transition Coefficients

=

≥ ∀ = ∀

1

, 1

ji N ji i

a j i a j

⎡ ⎤ = = = ≤ ≤ ⎣ ⎦

1

Consider processes where state transitions are time independent, i.e., | , 1 ,

ji t i t j

a P q S q S i j N

slide-6
SLIDE 6

6

Example of Discrete Markov Example of Discrete Markov Process Process

Once each day (e.g., at noon), the weather is observed and classified as being one of the following:

– State 1—Rain (or Snow; e.g. precipitation) – State 2—Cloudy – State 3—Sunny

with state transition probabilities:

{ }

⎡ ⎤ ⎢ ⎥ = = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8

ij

A a

slide-7
SLIDE 7

7

Discrete Markov Process Discrete Markov Process

Problem: Given that the weather on day 1 is sunny, what is the probability (according to the model) that the weather for the next 7 days will be “sunny-sunny-rain- rain-sunny-cloudy-sunny”? Solution: We define the observation sequence, O, as:

{ }

=

3 3 3 1 1 3 2 3

, , , , , , , O S S S S S S S S

and we want to calculate P(O|Model). That is:

[ ]

=

3 3 3 1 1 3 2 3

( |Model) , , , , , , , |Model P O P S S S S S S S S

slide-8
SLIDE 8

8

Discrete Markov Process Discrete Markov Process

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]

( ) ( ) ( )( )( )( )( )

[ ]

π π

= = ⋅ = = = ⋅ = = ≤ ≤

3 3 3 1 1 3 2 3 2 3 3 3 1 3 1 1 3 1 2 3 3 2 2 3 33 31 11 13 32 23 2 04 1

( |Model) , , , , , , , |Model | | | | | | 1 0.8 0.1 0.4 0.3 0.1 0.2 1.536 10 , 1

i i

P O P S S S S S S S S P S P S S P S S P S S P S S P S S P S S a a a a a a P q S i N

slide-9
SLIDE 9

9

Discrete Markov Process Discrete Markov Process

Problem: Given that the model is in a known state, what is the probability it stays in that state for exactly d days? Solution:

{ }

( ) ( )

− ∞ =

= ≠ + = = − = = = −

1 1 1

, , ,..., , 1 2 3 1 |Model, (1 ) ( ) 1 ( ) 1

i i i i j i d i ii ii i i i d ii

O S S S S S S d d P O q S a a p d d d p d a

slide-10
SLIDE 10

10

Exercise Exercise

Given a single fair coin, i.e., P (H=Heads)= P (T=Tails) = 0.5, which you toss once and observe Tails: a) what is the probability that the next 10 tosses will provide the sequence {H H T H T T H T T H}?

SOLUTION: SOLUTION: For a fair coin, with independent coin tosses, the probability of any specific observation sequence of length 10 (10 tosses) is

(1/2)10 since there are 210 such sequences and all are equally

  • probable. Thus:

P (H H T H T T H T T H) = (1/2)10

slide-11
SLIDE 11

11

Exercise Exercise

b) what is the probability that the next 10 tosses will produce the sequence {H H H H H H H H H H}?

SOLUTION: SOLUTION: Similarly: P (H H H H H H H H H H)= (1/2)10 Thus a specified run of length 10 is equally as likely as a specified run of interlaced H and T.

slide-12
SLIDE 12

12

Exercise Exercise

c) what is the probability that 5 of the next 10 tosses will be tails? What is the expected number of tails over the next 10 tosses?

=

⎛ ⎞⎛ ⎞ = = ⎜ ⎟⎜ ⎟ ⎝ ⎠ ⎝ ⎠

10 10

10 1 (Number of in 10 coin tosses) 5 2

d

E T d d

Thus, on average, there will be 5H and 5T in 10 tosses, but the probability

  • f exactly 5H and 5T is only about 0.25.

SOLUTION: SOLUTION: The probability of 5 tails in the next 10 tosses is just the number of observation sequences with 5 tails and 5 heads (in any sequence) and this is: P (5H, 5T)=(10C5) (1/2)10 = 252/1024≈0.25 since there are (10C5) combinations (ways of getting 5H and 5T) for 10 coin tosses, and each sequence has probability of (1/2)10 . The expected number of tails in 10 tosses is:

slide-13
SLIDE 13

13

Coin Toss Models Coin Toss Models

A series of coin tossing experiments is performed. The number of coins is unknown; only the results of each coin toss are revealed. Thus a typical observation sequence is:

= =

1 2 3...

...

T

O OO O O HHTTTHTTH H

Problem: Build an HMM to explain the observation sequence. Issues:

  • 1. What are the states in the model?
  • 2. How many states should be used?
  • 3. What are the state transition probabilities?
slide-14
SLIDE 14

14

Coin Toss Models Coin Toss Models

slide-15
SLIDE 15

15

Coin Toss Models Coin Toss Models

slide-16
SLIDE 16

16

Coin Toss Models Coin Toss Models

Problem: Consider an HMM representation (model λ) of a coin tossing experiment. Assume a 3-state model (corresponding to 3 different coins) with probabilities:

0.75 0.25 0.5 P(T) 0.25 0.75 0.5 P(H) State 3 State 2 State 1

and with all state transition probabilities equal to 1/3. (Assume initial state probabilities of 1/3). a) You observe the sequence: O=H H H H T H T T T T What state sequence is most likely? What is the probability of the

  • bservation sequence and this most likely state sequence?
slide-17
SLIDE 17

17

Coin Toss Problem Solution Coin Toss Problem Solution

SOLUTION: SOLUTION:

Given O=HHHHTHTTTT, the most likely state sequence is the one for which the probability of each individual observation is maximum. Thus for each H, the most likely state is S2 and for each T the most likely state is S3. Thus the most likely state sequence is:

S= S2 S2 S2 S2 S3 S2 S3 S3 S3 S3

The probability of O and S (given the model) is:

λ ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠

10 10

1 ( , | ) (0.75) 3 P O S

slide-18
SLIDE 18

18

Coin Toss Models Coin Toss Models

b) What is the probability that the observation sequence came entirely from state 1?

SOLUTION: SOLUTION:

The probability of O given that S is of the form:

λ λ λ λ λ = ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠ ⎛ ⎞ = = = ⎜ ⎟ ⎝ ⎠

1 1 1 1 1 1 1 1 1 1 10 10 10

ˆ is: 1 ˆ ( , | ) (0.50) 3 ˆ The ratio of ( , | ) to ( , | ) is: ( , | ) 3 57.67 ˆ 2 ( , | ) S S S S S S S S S S S P O S P O S P O S P O S R P O S

slide-19
SLIDE 19

19

Coin Toss Models Coin Toss Models

c) Consider the observation sequence:

=

  • O

HT T HTHHTTH

How would your answers to parts a and b change?

SOLUTION: SOLUTION: Given which has the same number of 's and 's, the answers to parts a and b would remain the same as the most likely states occur the same number of times in both cases. O H T

slide-20
SLIDE 20

20

Coin Toss Models Coin Toss Models

d) If the state transition probabilities were of the form:

= = = = = = = = =

11 21 31 12 22 32 13 23 33

0.9, 0.45, 0.45 0.05, 0.1 , 0.45 0.05, 0.45, 0.1 a a a a a a a a a

i.e., a new model λ’, how would your answers to parts a-c change? What does this suggest about the type of sequences generated by the models?

slide-21
SLIDE 21

21

Coin Toss Problem Solution Coin Toss Problem Solution

( ) ( )

6 3 10 10 9

SOLUTION: The new probability of and becomes: 1 ( , | ) (0.75) 0.1 0.45 3 ˆ The new probability of and becomes: 1 ˆ ( , | ) (0.50) (0.9) 3 The ratio is: O S P O S O S P O S λ λ ⎛ ⎞ ′ = ⎜ ⎟ ⎝ ⎠ ⎛ ⎞ ′ = ⎜ ⎟ ⎝ ⎠

10 6 3 5

3 1 1 1.36 10 2 9 2 R

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ = = ⋅ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

slide-22
SLIDE 22

22

Coin Toss Problem Solution Coin Toss Problem Solution

λ λ ⎛ ⎞ ′ = ⎜ ⎟ ⎝ ⎠ ⎛ ⎞ ′ = ⎜ ⎟ ⎝ ⎠

  • 10

6 3 10 9

Now the probability of and is not the same as the probability of and . We now have: 1 ( , | ) (0.75) (0.45) (0.1) 3 1 ˆ ( , | ) (0.50) (0.9) 3 with the ratio: O S O S P O S P O S λ λ

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ = = ⋅ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ′

10 6 3 3

3 1 1 1.24 10 2 2 9 Model , the initial model, clearly favors long runs of 's or 's, whereas model , the new model, clearly favors random sequences

  • f

's and 's. Thus even a R H T H T λ λ ′ run of 's or 's is more likely to

  • ccur in state 1 for model , and a random sequence of

's and 's is more likely to occur in states 2 and 3 for model . H T H T

slide-23
SLIDE 23

23

Balls in Urns Model Balls in Urns Model

slide-24
SLIDE 24

24

Elements of an HMM Elements of an HMM

{ } { }

ν ν ν = ∈ = i i i i

1 2 1 2

  • 1. , number of states in the model

states, , ,..., state at time , 2. , number of distinct observation symbols per state

  • bservation symbols,

, ,...,

  • bservation at time

N t M

N S S S S t q S M V

{ } { }

ν

+

∈ = = = = ≤ ≤ = ⎡ ⎤ = = ≤ ≤ ≤ ≤ ⎣ ⎦

1

,

  • 3. State transition probability distribution,

, ( | ), 1 ,

  • 4. Observation symbol probability distribution in state

( ) ( ) at | , 1 , 1

  • 5. Initial st

t ij ij t j t i j j k t j

t O V A a a P q S q S i j N j B b k b k P t q S j N k M

{ } [ ]

π π Π = = = ≤ ≤

1

ate distribution, , 1

i i i

P q S i N

slide-25
SLIDE 25

25

HMM Generator of Observations HMM Generator of Observations

ν = Π = =

1

  • 1. Choose an initial state,

, according to the initial state distribution, .

  • 2. Set

1.

  • 3. Choose

according to the symbol probability distribution in state , namely ( ).

  • 4. Transit to a new

i t k i i

q S t O S b k

( )

λ

+ =

= + ≤ Π

1

state, according to the state transition probability distribution for state , namely .

  • 5. Set

1; return to step 3 if ; otherwise terminate the procedure. Notation: = , ,

  • -HMM

t j i ij

q S S a t t t T A B

OT

O6 O5 O4 O3 O2 O1

  • bservation

qT … q6 q5 q4 q3 q2 q1 state T … 6 5 4 3 2 1 t

slide-26
SLIDE 26

26

Three Basic HMM Problems Three Basic HMM Problems

( )

λ λ = Π

1 2

  • -Given the observation sequence,

... , and a model = , , , how do we (efficiently) compute ( | ), the probability of the

  • bservation sequence?
  • -Given the observation sequence

T

O OO O A B P O Problem 1 Problem 2

( )

λ λ = = Π

1 2 1 2

, ... , how do we choose a state sequence ... which is optimal in some meaningful sense?

  • -How do we adjust the model parameters =

, , to maximize ( | )? Interpretation:

T T

O OO O Q q q q A B P O Problem 3 Problem --Evaluation or scoring problem.

  • -Learn structure problem.
  • -Training problem.

1 Problem 2 Problem 3

slide-27
SLIDE 27

27

Solution to Problem 1 Solution to Problem 1— —P(O| P(O|λ λ) )

λ λ π λ λ λ λ λ

= = ⋅ = = ⋅ = ∑

1 2 1 1 2 2 3 1

1 2 1 2 all

Consider the state sequence (there are such sequences): ... Then ( | , ) ( ) ( )... ( ) ( | ) ... and ( , | ) ( | , ) ( | ) Finally ( | ) ( , | ) (

T T T

T T q q q T q q q q q q q Q

N Q q q q P O Q b O b O b O P Q a a a P O Q P O Q P Q P O P O Q P fixed λ π

= ≈ ⋅ = = ⇒ ⋅ ⋅ ≈

1 1 1 2 2 1 1 2

1 2 , ,..., 100 72

| ) ( ) ( )... ( ) Calculations required 2 ; 5, 100 2 100 5 10 computations!

T T T T

q q q q q q q q T q q q T

O b O a b O a b O T N N T

slide-28
SLIDE 28

28

The “Forward” Procedure The “Forward” Procedure

1 2

Consider the forward variable, ( ), defined as the probability of the partial observation sequence (until time ) state at time , given the model, i.e., ( ) ( ... , | ) Inductively so

t i t t t i

i t S t i P OO O q S α α λ = = and

1 1 1 1 1 1 2 1 1 2

lve for ( ) as: 1. ( ) ( ), 1 2. ( ) ( ) ( ), 1 1 , 3. ( | ) ( ... , | ) ( ) Computation:

t i i N t t ij j t i N N T T i T i i

i i b O i N j i a b O t T i j N P O P OO O q S i N α α π α α λ λ α

+ + = = =

= ≤ ≤ ⎡ ⎤ = ≤ ≤ − ≤ ≤ ⎢ ⎥ ⎣ ⎦ = = =

∑ ∑ ∑

Initialization Induction Termination

72

versus 2 ; 5, 100 2500 versus 10

T

T TN N T = = ⇒

slide-29
SLIDE 29

29

The “Forward” Procedure The “Forward” Procedure

slide-30
SLIDE 30

30

The “Backward” Algorithm The “Backward” Algorithm

β β λ

+ +

+ = =

1 2

Consider the backward variable, ( ), defined as the probability of the partial observation sequence from 1 to the end, given state at time , and the model, i.e., ( ) ( ... | , )

t i t t t T t i

i t S t i P O O O q S β β β

+ + =

= ≤ ≤ = = − − ≤ ≤ ⋅

1 1 1 2

1. ( ) 1 , 1 2. ( ) ( ) ( ), 1 , 2,...,1 , 1 calculations, same as in forward case

T N t ij j t t j

i i N i a b O j t T T i N N T Inductive Solution : Initialization Induction

slide-31
SLIDE 31

31

Solution to Problem 2 Solution to Problem 2— —Optimal Optimal State Sequence State Sequence

⇒ ⇒

  • 1. Choose states,

, which are most likely maximize expected number of correct individual states

  • 2. Choose states,

, which are most likely maximize expected number o

t t

q individually q pair - wise ⇒ ⇒ f correct state pairs

  • 3. Choose states,

, which are most likely maximize expected number of correct state triples

  • 4. Choose states,

, which are most likely find the sing

t t

q triple - wise q T - wise λ le best state sequence which maximizes ( , | ) This solution is often called the Viterbi state sequence because it is found using the Viterbi algorithm. P Q O

slide-32
SLIDE 32

32

Maximize Individual States Maximize Individual States

We define ( ) as the probability of being in state at time , given the observation sequence, and the model, i.e., ( , | ) ( ) ( | , ) ( | ) then ( , | ) ( ) ( , | )

t i t i t t i t i t t i

i S t P q S O i P q S O P O P q S O i P q S O γ λ γ λ λ λ γ λ = = = = = = =

[ ]

1 1 1 1

( ) ( ) ( ) ( ) ( | ) ( ) ( ) with ( ) 1 , then argmax ( ) , 1 : need not obey state transition constraints.

t t t t N N t t i i N t i t t i N t

i i i i P O i i i t q i t T q α β α β λ α β γ γ

= = = ∗ ≤ ≤ ∗

= = = ∀ = ≤ ≤

∑ ∑ ∑

Problem

slide-33
SLIDE 33

33

Best State Sequence Best State Sequence— —The The Viterbi Viterbi Algorithm Algorithm

[ ]

1 2 1

1 2 1 1 2 , ,...,

Define ( ) as the highest probability along a single path, at time , which accounts for the first observations, i.e., ( ) max ... , , ... | We must keep tra

t

t t t t t q q q

i t t i P q q q q i OO O δ δ λ

= = ck of the state sequence which gave the best path, at time , to state . We do this in the array ( ).

t

t i i ψ

slide-34
SLIDE 34

34

The The Viterbi Viterbi Algorithm Algorithm

( )

δ π ψ δ δ ψ δ

− ≤ ≤ − ≤ ≤

= ≤ ≤ = ≤ ≤ ⎡ ⎤ = ≤ ≤ ≤ ≤ ⎣ ⎦ ⎡ ⎤ = ≤ ≤ ≤ ≤ ⎣ ⎦

1 1 1 1 1 1 1

( ) ( ), 1 ( ) 0, 1 ( ) max ( ) , 2 , 1 ( ) argmax ( ) , 2 , 1

i i t t ij j t i N t t ij i N

i b O i N i i N j i a b O t T j N j i a t T j N Step 1- -Initialization Step 2 - -Recursion Step 3 - -Termination

[ ] [ ]

( )

δ δ ψ

∗ ≤ ≤ ∗ ≤ ≤ ∗ ∗ +

= = = = − − ≈ ∗

1 1 t+1 1 2

max ( ) argmax ( ) , 1 , 2,...,1 Calculation

  • perations ( ,+)

T i N T T i N t t

P i q i q q t T T N T Step 4 - -Path (State Sequence) Backtracking

slide-35
SLIDE 35

35

Alternative Alternative Viterbi Viterbi Implementation Implementation

( ) ( ) ( ) ( ) ( )

1 1 1 1 1 1

log 1 log 1 , 1 log 1 , ( ) log( ( )) , 1 ( ) 0, 1 ( ) log( (j))=max ( ) , 2

i i i t i t ij ij i i t t t ij j t i N

i N b O b O i N t T a a i j N i i b O i N i i N j i a b O π π δ δ π ψ δ δ δ −

≤ ≤

= ≤ ≤ ⎡ ⎤ = ≤ ≤ ≤ ≤ ⎣ ⎦ ⎡ ⎤ = ≤ ≤ ⎣ ⎦ = = + ≤ ≤ = ≤ ≤ ⎡ ⎤ = + + ⎣ ⎦ Step 1- -Initialization Step 2 - -Recursion

  • 1

1 1 1 1 1 2

, 1 ( ) argmax ( ) , 2 , 1 max ( ) , 1 argmax ( ) , 1 ( ), 1 , 2,...,1

t t ij i N T i N T T i N t t t

t T j N j i a t T j N P i i N q i i N q q t T T N T ψ δ δ δ ψ

− ≤ ≤ ∗ ≤ ≤ ∗ ≤ ≤ ∗ ∗ + +

≤ ≤ ≤ ≤ ⎡ ⎤ = + ≤ ≤ ≤ ≤ ⎣ ⎦ ⎡ ⎤ = ≤ ≤ ⎣ ⎦ ⎡ ⎤ = ≤ ≤ ⎣ ⎦ = = − − ≈ Step 3 - -Termination Step 4 - -Backtracking Calculation

  • additions
slide-36
SLIDE 36

36

Problem Problem

Given the model of the coin toss experiment used earlier (i.e., 3 different coins) with probabilities:

0.75 0.25 0.5 P(T) 0.25 0.75 0.5 P(H) State 3 State 2 State 1

with all state transition probabilities equal to 1/3, and with initial state probabilities equal to 1/3. For the observation sequence O=H H H H T H T T T T, find the Viterbi path of maximum likelihood.

slide-37
SLIDE 37

37

Problem Solution Problem Solution

( )

δ δ δ δ δ = = = ≤ ≤ =

1 1 1 2

Since all terms are equal to 1/3, we can omit these terms (as well as the initial state probability term) giving: (1) 0.5, (2) 0.75, (3) 0.25 The recursion for ( ) gives 2 10 (1) (0.75)(0.5)

ij t

a j t δ δ δ δ δ δ δ δ δ δ δ δ = = = = = = = = = = = =

2 2 2 2 3 2 3 3 3 3 4 3 4 4 4 4 4 5 5 5 5 6

, (2) (0.75) , (3) (0.75)(0.25) (1) (0.75) (0.5), (2) (0.75) , (3) (0.75) (0.25) (1) (0.75) (0.5), (2) (0.75) , (3) (0.75) (0.25) (1) (0.75) (0.5), (2) (0.75) (0.25), (3) (0.75) (1) (0.75 δ δ δ δ δ δ δ δ δ δ δ δ = = = = = = = = = = =

5 6 5 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 10

) (0.5), (2) (0.75) , (3) (0.75) (0.25) (1) (0.75) (0.5), (2) (0.75) (0.25), (3) (0.75) (1) (0.75) (0.5), (2) (0.75) (0.25), (3) (0.75) (1) (0.75) (0.5), (2) (0.75) (0.25), (3) (0.75) δ δ = = =

9 9 10 10 10

(1) (0.75) (0.5), (2) (0.75) (0.25), (3) (0.75) This leads to a diagram (trellis) of the form:

slide-38
SLIDE 38

38

Solution to Problem 3 Solution to Problem 3— —the Training Problem the Training Problem

  • no globally optimum solution is known
  • all solutions yield local optima

– can get solution via gradient techniques – can use a re-estimation procedure such as the Baum-Welch or EM method

  • consider re-estimation procedures

– basic idea: given a current model estimate, λ, compute expected values of model events, then refine the model based

  • n the computed values

[ ] [ ]

λ λ λ ξ ⎯⎯⎯⎯⎯⎯ → ⎯⎯⎯⎯⎯⎯ → ⋅⋅⋅ +

Model Events Model Events (0) (1) (2)

Define ( , ), the probability of being in state at time , and state at time 1 , given the model and the observation sequence, i.e.,

E E t i j

i j S t S t ξ λ

+

⎡ ⎤ = = = ⎣ ⎦

1

( , ) , | ,

t t i t j

i j P q S q S O

slide-39
SLIDE 39

39

The Training Problem The Training Problem

ξ λ

+

⎡ ⎤ = = = ⎣ ⎦

1

( , ) , | ,

t t i t j

i j P q S q S O

slide-40
SLIDE 40

40

The Training Problem The Training Problem

ξ λ λ ξ λ α β α β λ α β γ ξ γ

+ + + + + + + + = = =

⎡ ⎤ = = = ⎣ ⎦ ⎡ ⎤ = = ⎣ ⎦ = = = = =

∑∑ ∑

1 1 1 1 1 1 1 1 1 1 1

( , ) , | , , , | ( , ) ( | ) ( ) ( ) ( ) ( ) ( ) ( ) ( | ) ( ) ( ) ( ) ( ) ( , ) ( ) Expected num

t t i t j t i t j t t ij j t t t ij j t t N N t ij j t t i j N t t j t

i j P q S q S O P q S q S O i j P O i a b O j i a b O j P O i a b O j i i j i ξ

− = − =

=

∑ ∑

1 1 1 1

ber of transitions from ( , ) Expected number of transitions from to

T i t T t i j t

S i j S S

slide-41
SLIDE 41

41

Re Re-

  • estimation Formulas

estimation Formulas

π γ ξ γ

− = =

= = = = = =

∑ ∑

1 1 1 1

Expected number of times in state at 1 ( ) Expected number of transitions from state to state Expected number of transitions from state ( , ) ( ) Expected numb ( )

i i i j ij i T t t T t t j

S t i S S a S i j i b k

ν

ν γ γ

= ∋ = =

=

∑ ∑

1 1

er of times in state with symbol Expected number of times in state ( ) ( )

t k

k T t t O T t t

j j j j

slide-42
SLIDE 42

42

Re Re-

  • estimation Formulas

estimation Formulas

( )

( )

λ λ λ = Π = Π If , , is the initial model, and , , is the re-estimated model, then it can be proven that either:

  • 1. the initial model, , defines a critical point of the likelihood

function, in wh A B A B λ λ λ λ λ λ λ = > ich case , or

  • 2. model is more likely than model in the sense that

( | ) ( | ), i.e., we have found a new model from which the observation sequence is more likely to have b P O P O λ λ een produced. : Iteratively use in place of , and repeat the re-estimation until some limiting point is reached. The resulting model is called the maximum likelihood ( Conclusion ML) HMM.

slide-43
SLIDE 43

43

Re Re-

  • estimation Formulas

estimation Formulas

λ

λ λ λ λ λ λ λ λ λ ⎡ ⎤ = ⎣ ⎦ ⎡ ⎤ ⇒ ⎣ ⎦

  • 1. The re-estimation formulas can be derived by maximizing the

auxiliary function ( , ) over , i.e., ( , ) ( , | )log ( , | It can be proved that: max ( , )

q

Q Q P O q P O q Q P λ λ λ λ ≥ i ( | ) ( | ) Eventually the likelihood function converges to a critical point

  • 2. Relation to EM algorithm:

E (Expectation) step is the calculation of the auxiliary function, ( , ) O P O Q λ i M (Modification) step is the maximization over

slide-44
SLIDE 44

44

Notes on Re Notes on Re-

  • estimation

estimation

π π λ π π π π π π

= = = =

= = = = ∂ ∂ = = ∂ ∂ ∂ ∂ =

∑ ∑ ∑ ∑

1 1 1 i i 1

  • 1. Stochastic constraints on

, , ( ) are automatically met, i.e., 1 , 1 , ( ) 1

  • 2. At the critical points of

( | ), then

i ij j N N M i ij j i j k i i N k k k ij ij

a b k a b k P P O P P P a a

= =

= ∂ ∂ ∂ ∂ = = ∂ ∂ ⇒

∑ ∑

  • 1

1

( ) ( ) ( ) ( ) ( ) ( ) at critical points, the re-estimation formulas are correct.

ij ij N ik k ik j j j j M j j

a a P a a P b k b k b k b k P b l b exactly

slide-45
SLIDE 45

45

Variations on Variations on HMM’s HMM’s

  • 1. Types of HMM—model structures
  • 2. Continuous observation density

models—mixtures

  • 3. Autoregressive HMM’s—LPC links
  • 4. Null transitions and tied states
  • 5. Inclusion of explicit state duration density

in HMM’s

  • 6. Optimization criterion—ML, MMI, MDI
slide-46
SLIDE 46

46

Types of HMM Types of HMM

π = ⎧ = ⎨ ≠ ⎩ = >

  • 1. Ergodic models--no transient states
  • 2. Left-right models--all transient states (except the last state)

with the constraints: 1, 1 0, 1 Controlled transitions implies:

i ij

i i a j i = > + Δ Δ = 0, ( 1 ,2 typically)

  • 3. Mixed forms of ergodic and left-right models (e.g., parallel branches)

: Constraints of left-right models don't affect re-estimation formulas (i.e.,

ij

a j i Note a parameter initially set to 0 remains at 0 during re-estimation).

slide-47
SLIDE 47

47

Types of HMM Types of HMM

Ergodic Ergodic Model Model Left Left-

  • Right Model

Right Model Mixed Model Mixed Model

slide-48
SLIDE 48

48

Continuous Observation Density Continuous Observation Density HMM’s HMM’s

{ }

μ

=

⎡ ⎤ = ≤ ≤ ⎣ ⎦ = = =

  • 1

1 2

Most general form of pdf with a valid re-estimation procedure is: ( ) , , , 1

  • bservation vector=

, ,..., number of mixture densities gain of

  • th mi

M j jm jm jm m D jm

b x c x U j N x x x x M c m μ

=

= = = ≥ ≤ ≤ ≤ ≤ = ≤ ≤

  • 1

xture in state any log-concave or elliptically symmetric density (e.g., a Gaussian) mean vector for mixture , state covariance matrix for mixture , state 0, 1 , 1 1 , 1

jm jm jm M jm m

j m j U m j c j N m M c j

∞ −∞

= ≤ ≤

( ) 1 , 1

j

N b x dx j N

slide-49
SLIDE 49

49

State Equivalence Chart State Equivalence Chart

Equivalence of Equivalence of state with state with mixture density mixture density to multi to multi-

  • state

state single mixture single mixture case case

S S S S S S S S S S

slide-50
SLIDE 50

50

Re Re-

  • estimation for Mixture Densities

estimation for Mixture Densities

( )( )

γ γ γ μ γ γ μ μ γ γ

= = = = = = =

= ⋅ = ′ ⋅ − − =

∑ ∑∑ ∑ ∑ ∑ ∑

i

1 1 1 1 1 1 1

( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) is the probability of being in state at time with the

  • th mixture component acc

T t t jk T M t t k T t t t jk T t t T t t jk t jk t jk T t t t

j k c j k j k O j k j k O O U j k j k j t k μ α β γ α β μ

= =

⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦

∑ ∑

  • 1

1

  • unting for

( , , ) ( ) ( ) ( , ) ( ) ( ) ( , , )

t jk t jk jk t t t N M t t jm t jm jm j m

O c O U j j j k j j c O U

slide-51
SLIDE 51

51

Autoregressive HMM Autoregressive HMM

= =

1 1

Consider an observation vector ( , ,..., ) where each is a waveform sample, and represents a frame of the signal (e.g., 256 samples). We assume is related to previous samples of by

K k k

O x x x x O K x O σ

− =

= − + ≤ ≤ −

1 2

a Gaussian autoregressive process of order , i.e., , 0 1 where are Gaussian, independent, identically distributed random variables with zero mean and variance , and

p k i k i k i k

p O aO e k K e a πσ δ σ δ

− =

≤ ≤ → ∞ ⎧ ⎫ = − ⎨ ⎬ ⎩ ⎭ = + ∑

2 / 2 2 1

,1 are the autoregressive or predictor coefficients. As , then 1 ( ) (2 ) exp ( , ) 2 where ( , ) (0) (0) 2 ( ) ( )

i K p a a i

i p K f O O a O a r r r i r i

slide-52
SLIDE 52

52

Autoregressive HMM Autoregressive HMM

[ ] ( )

α σ α σ π

− + = − − + = = −

= = ≤ ≤ = ≤ ≤ ′ ⎡ ⎤ = ⎣ ⎦ ⎡ ⎤ = = ⎢ ⎥ ⎣ ⎦ = = =

∑ ∑ ∑

1 1 2 2 2 1 2

( ) , ( 1), 1 ( ) , 0 1 , , ,..., The prediction residual is: Consider the normalized observation vector ˆ ˆ ( ) (2 )

p i a n n i n K i n n i n p K i i

r i a a a i p r i x x i p a a a a E e K O O O K f O δ ⎛ ⎞ − ⎜ ⎟ ⎝ ⎠ =

/ 2

ˆ exp ( , ) 2 ˆ In practice, is replaced by , the effective frame length, e.g., ˆ /3 for frame overlap of 3 to 1.

K

K O a K K K K

slide-53
SLIDE 53

53

Application of Autoregressive HMM Application of Autoregressive HMM

π δ

= −

= ⎧ ⎫ = − ⎨ ⎬ ⎩ ⎭

1 / 2

(0) ( ) ( ) (2 ) exp ( , ) 2 Each mixture characterized by predictor vector or by autocorrelation vector from which predictor vector can be derived. Re-estimation formulas for

M j jm jm m K jm jm

b c b O K b O O a γ γ α β γ α β

= = = =

⋅ = ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦

∑ ∑ ∑ ∑

1 1 1 1

are: ( , ) ( , ) ( ) ( ) ( ) ( , ) ( ) ( ) ( )

jk T t t t jk T t t jk jk t t t t N M t t jk jk t j k

r j k r r j k c b O j j j k j j c b O

slide-54
SLIDE 54

54

Null Transitions and Tied States Null Transitions and Tied States

Null Transitions: transitions which produce no

  • utput, and take no time, denoted by φ

Tied States: sets up an equivalence relation between HMM parameters in different states

– number of independent parameters of the model reduced – parameter estimation becomes simpler – useful in cases where there is insufficient training data for reliable estimation of all model parameters

slide-55
SLIDE 55

55

Null Transitions Null Transitions

slide-56
SLIDE 56

56

Inclusion of Explicit State Duration Density Inclusion of Explicit State Duration Density

= = −

1

For standard HMM's, the duration density is: ( ) probability of exactly observations in state ( ) (1 ) With arbitrary state duration density, ( ), observations are generated as follows:

i i d ii ii i

p d d S a a p d π =

1 1

1 1 1 1 2

  • 1. an initial state,

, is chosen according to the initial state distribution,

  • 2. a duration

is chosen according to the state duration density ( )

  • 3. observations

...

i i q d

q S d p d O O O

=

= =

1 1 1 1 1 1

1 2 1 2 1 2

are chosen according to the joint density ( ... ). Generally we assume independence, so ( ... ) ( )

  • 4. the next state,

, is chosen according t

q d d q d q t t j

b O O O b O O O b O q S =

1 2 1 1

  • the state transition

probabilities, , with the constraint that 0, i.e., no transition back to the same state can occur.

q q q q

a a

slide-57
SLIDE 57

57

Explicit State Duration Density Explicit State Duration Density

Standard HMM Standard HMM HMM with explicit state duration density HMM with explicit state duration density

slide-58
SLIDE 58

58

Explicit State Duration Density Explicit State Duration Density

+ + + + + +

+ + + = = ⇒

1 1 1 2 1 2 1 2 3

1 1 2 1 2 3 1 2 3 1 1 1 1

1 1 1 state duration

  • bservations

... ... ... Assume:

  • 1. first state,

, at 1

  • 2. last state,

, at entire duration intervals are inc

d d d d d d d d d r

t d d d q q q d d d O O O O O O q t q t T begins ends

{ } { }

α α λ = = = =

1 2 1 2 1 2 1 2

luded within the observation sequence ... Modified : ( ) ( ... , ending at | ) Assume states in first observations, i.e., ... with ... wit

T t t i r r i r

O O O i P O O O S t r t Q q q q q S D d d d

=

=

1

h

r s s

d t

slide-59
SLIDE 59

59

Explicit State Duration Density Explicit State Duration Density

α π α α

− −

+ + + + + + − = = − +

= ⋅ ⋅ =

∑∑ ∏

1 1 1 1 2 2 1 1 2 1 1 2 1

1 1 2 1 2 1 2 ... 1 1 1

Then we have ( ) ( ) ( ... | ) ( ) ( ... | )... ( ) ( ... | ) By induction: ( ) ( ) ( ) ( )

r r r r

t q q d q d q q q d d d q q q r d d d t r t t t d ij j j s d s t d

i p d P OO O q a p d P O O q a p d P O O q j i a p d b O α α π α π α α π α λ α

= = ≠ = − = = ≠ = = − =

= = + = + =

∑∑ ∑ ∏ ∑ ∑ ∏ ∏ ∑

1 1 1 2 2 1 2 1, 1 3 3 2 3 3 1 1, 1 4 1

Initialization of ( ) : ( ) (1) ( ) ( ) (2) ( ) ( ) (1) ( ) ( ) (3) ( ) ( ) ( ) ( ) ( | ) ( )

N D i t i i i N i i i s ji i i j j i s N i i i s d ji i i s d j j i s s d N T i

i i p b O i p b O j a p b O i p b O j a p d b O P O i

slide-60
SLIDE 60

60

Explicit State Duration Density Explicit State Duration Density

δ = = i i

1 2 1 2

re-estimation formulas for , ( ), and ( ) can be formulated and appropriately interpreted modifications to Viterbi scoring required, i.e., ( ) ( ... , ... ending at

ij i i t t r i

a b k p d i P OO O q q q S t δ δ δ δ δ

− ≤ ≤ ≠ ≤ ≤ = − + − −

⎡ ⎤ = ⎢ ⎥ ⎣ ⎦ ⇒ ⋅

i i

1 , 1 1 1

| ) ( ) max max ( ) ( ) ( ) storage required for ... locations maximization involves all terms--not just old 's and as in prev

t t t d ji i j s j N j i d D s t d t t D ji

O i j a p d b O N D a Basic Recursion : ⇒ ≈ = =

2 2

ious case significantly larger computational load ( / 2) computations involving ( ) Example: 5, 20 implicit duration explicit duration storage 5 100 computation 2500 500,000

j

D N T b O N D

slide-61
SLIDE 61

61

Issues with Explicit State Duration Issues with Explicit State Duration Density Density

  • 1. quality of signal modeling is often improved significantly
  • 2. significant increase in the number of parameters per state

( duration estimates)

  • 3. significant increase in the computation associate

D

2 2

d with probability calculation ( / 2)

  • 4. insufficient data to give good

( ) estimates

  • 1. use parametric state duration density

( ) ( , , ) -- Gaussian ( )

i i i i i i

D p d p d d p d μ σ η ≈ = =

  • Alternatives :

1

  • - Gamma

( )

  • 2. incorporate state duration information after probability

calculation, e.g., in a post-processor

i i id

i

d e

ν ν η

ν

− −

Γ

slide-62
SLIDE 62

62

Alternatives to ML Estimation Alternatives to ML Estimation

λ λ λ λ

∗ = 1 2

Assume we wish to design different HMM's, , ,..., . Normally we design each HMM, , based on a training set of

  • bservations,

, using a maximum likelihood (ML) criterion, i.e., max

V V V V

V O P

( )

λ

λ λ λ λ λ λ λ λ

=

⎡ ⎤ ⎣ ⎦ = ⎡ ⎤ = − ⎢ ⎥ ⎣ ⎦

1 2 1

| Consider the , , between the observation sequence, , and the set of models , ,..., , log ( | ) log ( | ) Consider maximizing

  • ver ,

V

V V V V V V V V V V W w V

P O I O I P O P O I mutual information complete

λ

λ λ λ λ

∗ =

⎡ ⎤ = − ⎢ ⎥ ⎣ ⎦

i

1

giving max log ( | ) log ( | ) choose so as to separate the correct model, , from all

  • ther models, as much as possible, for the training set,

.

V V V V V W w V V

I P O P O O

slide-63
SLIDE 63

63

Alternatives to ML Estimation Alternatives to ML Estimation

( )

λ

λ λ

∗ = =

⎧ ⎫ ⎡ ⎤ = − ⎨ ⎬ ⎢ ⎥ ⎣ ⎦ ⎩ ⎭

∑ ∑

i

1 1

Sum over all such training sets to give models according to an MMI criterion, i.e., max log ( | log ( | ) solution via steepest descent methods.

V V v v v w v w

I P O P O

slide-64
SLIDE 64

64

Comparison of Comparison of HMM’s HMM’s

λ λ

1 2

: given two HMM's, and , is it possible to give a measure of how similar the two models are Problem Example :

( ) ( )

ν ν ⇔ = + − − = + − − − − = = = + − − = − = =

1 1 2 2

For , , we require ( ) to be the same for both models and for all symbols . Thus we require (1 )(1 ) (1 )(1 ) 2 2 1 2 1 2 Let 0.6, 0.7,

equivalent t k k

A B A B P O pq p q rs r s pq p q rs r s p pq r s r p q r = =

  • 0.2, then

13/30 0.433 s

slide-65
SLIDE 65

65

Comparison of Comparison of HMM’s HMM’s

Thus the two models have very different and matrices, but are equivalent in the sense that all symbol probabilities (averaged over time) are the same. We generalize the concept of model distance (dis A B λ λ λ λ λ λ λ λ ⎡ ⎤ = − ⎣ ⎦

1 2 1 2 (2) (2) 1 2 1 2 (2)

  • similarity) by

defining a distance measure, ( , ) between two Markov sources, and , as 1 ( , ) log ( | ) log ( | ) where is a sequence of observations generated

T T T

D D P O P O T O

[ ]

λ λ λ λ λ λ λ = +

2 1 2 1 2 2 1

by model , and scored by models. We symmetrize by using the relation: 1 ( , ) ( , ) ( , ) 2

S

D D D D both

slide-66
SLIDE 66

66

Implementation Issues for Implementation Issues for HMM’s HMM’s

  • 1. Scaling—to prevent underflow

and/or overflow.

  • 2. Multiple Observation Sequences—to

train left-right models.

  • 3. Initial Estimates of HMM

Parameters—to provide robust models.

  • 4. Effects of Insufficient Training Data
slide-67
SLIDE 67

67

Scaling Scaling

α α

+

= =

⎡ ⎤ ⎢ ⎥ ⎣ ⎦

∏ ∏

i i

1

  • 1

1 1

( ) is a sum of a large number of terms, each of the form: ( ) since each and term is less than 1, as gets larger, ( ) exponentially heads to 0. Thu

s s s

t t t q q q s s s t

i a b O a b t i α α α α α α α α

= =

= = =

∑ ∑

i i

1 1

s scaling is required to prevent underflow. consider scaling ( ) by the factor 1 , independent of ( ) we denote the scaled 's as: ( ) ˆ ( ) ( ) ( ) ˆ (

t t N t i t t t t N t i t

i c t i i i c i i

=

=

1

) 1

N i

i

slide-68
SLIDE 68

68

Scaling Scaling

τ τ

α α α α α α

− = − = − = = − − =

= = ⎡ = ⎣

∑ ∑ ∑∑ ∏

i i i

1 1 1 1 1 1 1 1 1 1

for fixed , we compute ˆ ( ) ( ) ( ) scaling gives ˆ ( ) ( ) ˆ ( ) ˆ ( ) ( ) by induction we get ˆ ( )

N t t ji i t j N t ji i t j t N N t ji i t i j t t

t i j a b O j a b O i j a b O j c

τ τ τ τ

α α α α α α

− − − = = − − = = = =

⎤ ⎢ ⎥ ⎦ ⎡ ⎤ ⎢ ⎥ ⎣ ⎦ = = ⎡ ⎤ ⎢ ⎥ ⎣ ⎦

∑ ∏ ∑ ∑∑ ∏

i

1 1 1 1 1 1 1 1 1 1 1

( ) giving ( ) ( ) ( ) ˆ ( ) ( ) ( ) ( )

t t N t ji i t j t t N t N N t t ji i t i i j

j j c a b O i i i j c a b O

slide-69
SLIDE 69

69

Scaling Scaling

β α β β α β = i i for scaling the ( ) terms we use the scale factors as for the ( ) terms, i.e., ˆ ( ) ( ) since the magnitudes of the and terms are comparable. the re-est

t t t t t

i i i c i same

τ τ

α β α β α β α

− + + = − + + = = =

= =

∑ ∑∑

i

1 1 1 1 1 1 1 1 1 1

imation formula for in terms of the scaled 's and 's is: ˆ ˆ ( ) ( ) ( ) ˆ ˆ ( ) ( ) ( ) we have ˆ ( )

ij T t ij j t t t ij N T t ij j t t j t t

a i a b O j a i a b O j i c

τ τ

α α β β β

+ + + + = +

⎡ ⎤ = ⎢ ⎥ ⎣ ⎦ ⎡ ⎤ = = ⎢ ⎥ ⎣ ⎦

∏ ∏

1 1 1 1 1

( ) ( ) ˆ ( ) ( ) ( )

t t t t T t t t t t

i C i j c j D j

slide-70
SLIDE 70

70

Scaling Scaling

τ τ τ τ τ τ

α β α β

− + + + = − + + + = = + = = + =

= = = =

∑ ∑∑ ∏ ∏ ∏

i i

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

giving ( ) ( ) ( ) ( ) ( ) ( ) independent of .

  • 1. scaling procedur

T t t ij j t t t t ij N T t t ij j t t t j t t T T t t t

C i a b O D j a C i a b O D j C D c c c C t Notes on Scaling : π λ = e works equally well on or coefficients

  • 2. scaling need not be performed each iteration; set

1 whenever scaling is skipped

  • c. can solve for (

| ) from scaled coefficients as:

t

B c P O α α λ α λ

= = = = = =

= = = = = −

∑ ∑ ∏ ∑ ∏ ∑

1 1 1 1 1 1

( ) ( ) 1 ( | ) ( ) 1/ log ( | ) log( )

T N N t T T i i t T N T t i t T t t

c i C i P O i c P O c

slide-71
SLIDE 71

71

Multiple Observation Sequences Multiple Observation Sequences

⎡ ⎤ = ⎣ ⎦ =

(1) (2) ( ) ( ) ( ) ( ) 1 2

For left-right models, we need to use multiple sequences of observations for training. Assume a set of

  • bservation sequences (i.e., training utterances):

, ,..., where

K k k k

K O O O O O O O λ λ α β α β α

= = − + + = = − = =

⎡ ⎤ ⎣ ⎦ = = = =

∏ ∏ ∑∑ ∑∑

( ) ( ) 1 1 1 ( ) 1 1 1 1 1 1 1

... We wish to maximize the probability ( | ) ( | ) ( ) ( ) ( ) ( ) ( ) Scaling requires: 1 ˆ ( )

k k k

k T K K k k k k T K k k k t ij j t t k t ij T K k k t t k t k t ij k ij

O P O P O P i a b O j a i i i a P a β α β

− + + = = − = =

∑ ∑ ∑ ∑

i

1 ( ) 1 1 1 1 1 1 1

ˆ ( ) ( ) 1 ˆ ˆ ( ) ( ) all scaling factors cancel out

k k

T K k k j t t k t T K k k t t k t k

b O j i i P

slide-72
SLIDE 72

72

Initial Estimates of HMM Initial Estimates of HMM Parameters Parameters

π π ε ≠ ≠ ≥

  • - choose based on physical considerations
  • - choose based on model fits
  • - random or uniform (

0)

  • - random or uniform (

0) ( ) -- random or uniform ( ( ) ) ( ) -- need good initial

i i ij ij j j j

N M a a b k b k b O estimates of mean vectors; need reasonable estimates of covariance matrices

slide-73
SLIDE 73

73

Effects of Insufficient Training Effects of Insufficient Training Data Data

Insufficient training data leads to poor estimates of model parameters. Possible Solutions:

  • 1. use more training data--often this is impractical
  • 2. reduce the size of the model--often there are physi

ε δ ≥ ≥ ⋅ cal reasons for keeping a chosen model size

  • 3. add extra constraints to model parameters

( ) ( , )

  • ften the model performance is relatively insensitive to e

j jk

b k U r r ε δ λ ελ ε λ′ xact choice

  • f ,
  • 4. method of deleted interpolation

= +(1- )

slide-74
SLIDE 74

74

Methods for Insufficient Data Methods for Insufficient Data

Performance insensitivity to Performance insensitivity to ε ε

slide-75
SLIDE 75

75

Deleted Interpolation Deleted Interpolation

slide-76
SLIDE 76

76

Isolated Word Recognition Using Isolated Word Recognition Using HMM’s HMM’s

Assume a vocabulary of words, with

  • ccurrences of each spoken word

in a training set. Observation vectors are spectral characterizations of the word. For isolated word recognition, we do the follow V K

( )

λ Π ing:

  • 1. for each word, , in the vocabulary, we must build an HMM,

, i.e., we must re-estimate model parameters , , that optimize the likelihood of the training set observation vector

v

v A B

[ ]

=

1 2

s for the -th word. (TRAINING)

  • 2. for each unknown word which is to be recognized, we do the following:
  • a. measure the observation sequence

...

  • b. calculate model likelihoo

T

v O O O O λ λ

∗ ≤ ≤

≤ ≤ ⎡ ⎤ = ⎣ ⎦ ⋅ = = = ⇒

1 2 5

ds, ( | ), 1

  • c. select the word whose model likelihood score is highest

argmax ( | ) Computation is on the order of required; 100, 5, 40 10 com

v v v V

P O v V v P O V N T V N T putations

slide-77
SLIDE 77

77

Isolated Word HMM Recognizer

slide-78
SLIDE 78

78

Choice of Model Parameters

1. Left-right model preferable to ergodic model (speech is a left-right process) 2. Number of states in range 2-40 (from sounds to frames)

  • Order of number of distinct sounds in the word
  • Order of average number of observations in word

3. Observation vectors

  • Cepstral coefficients (and their second and third order derivatives)

derived from LPC (1-9 mixtures), diagonal covariance matrices

  • Vector quantized discrete symbols (16-256 codebook sizes)

4. Constraints on bj(O) densities

  • bj(k)>ε for discrete densities
  • Cjm>δ, Ujm(r,r)>δ for continuous densities
slide-79
SLIDE 79

79

Performance Vs Number of Performance Vs Number of States in Model States in Model

slide-80
SLIDE 80

80

HMM Feature Vector Densities HMM Feature Vector Densities

slide-81
SLIDE 81

81

Segmental K Segmental K-

  • Means

Means Segmentation into States Segmentation into States

Motivation:

derive good estimates of the bj(O) densities as required for rapid convergence of re-estimation procedure.

Initially:

training set of multiple sequences of observations, initial model estimate.

Procedure:

segment each observation sequence into states using a Viterbi procedure. For discrete observation densities, code all observations in state j using the M-codeword codebook, giving

bj(k) = number of vectors with codebook index k, in state j, divided by the number of vectors in state j.

for continuous observation densities, cluster the observations in state j into a set of M clusters, giving

slide-82
SLIDE 82

82

Segmental K Segmental K-

  • Means

Means Segmentation into States Segmentation into States

cjm = number of vectors assigned to cluster m of state j divided by the number of vectors in state j. μjm = sample mean of the vectors assigned to cluster m of state j Ujm = sample covariance of the vectors assigned to cluster m of state j

use as the estimate of the state transition probabilities

aii = number of vectors in state i minus the number of

  • bservation sequences for the training word divided by the

number of vectors in state i. ai,i+1 = 1 – aii

the segmenting HMM is updated and the procedure is iterated until a converged model is obtained.

slide-83
SLIDE 83

83

Segmental K Segmental K-

  • Means Training

Means Training

slide-84
SLIDE 84

84

HMM Segmentation for /SIX/ HMM Segmentation for /SIX/

slide-85
SLIDE 85

85

Digit Recognition Using Digit Recognition Using HMM’s HMM’s

unknown unknown log log energy energy frame frame likelihood likelihood scores scores

frame frame cumulative cumulative scores scores

state state segmentation segmentation

  • ne
  • ne
  • ne
  • ne
  • ne
  • ne

nine nine

  • ne
  • ne
  • ne
  • ne

nine nine nine nine

slide-86
SLIDE 86

86

Digit Recognition Using Digit Recognition Using HMM’s HMM’s

unknown unknown log energy log energy frame frame likelihood likelihood scores scores frame frame cumulative cumulative scores scores

state state segmentation segmentation seven seven seven seven seven seven seven seven seven seven six six six six six six

slide-87
SLIDE 87

87