Hidden Markov Models Hsin-min Wang References: 1. L. R. Rabiner - - PowerPoint PPT Presentation

hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models Hsin-min Wang References: 1. L. R. Rabiner - - PowerPoint PPT Presentation

Hidden Markov Models Hsin-min Wang References: 1. L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 2. X. Huang et. al., (2001) Spoken Language Processing, Chapter 8 3. L. R. Rabiner, (1989) A Tutorial on


slide-1
SLIDE 1

1

Hidden Markov Models

Hsin-min Wang

References:

1.

  • L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6

2.

  • X. Huang et. al., (2001) Spoken Language Processing, Chapter 8

3.

  • L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in

Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989

slide-2
SLIDE 2

2

Speech Recognition - Acoustic Processing

Speech Waveform Framing

O

  • 1 o2 o3 o4 ............... ot

...................

Feature vector sequence

a 11 a 12 a 22 a 23 a 33

s=1 s=2 s=3

  • 1 o2
  • t

b (o)

1

b (o)

2

b (o)

3

) ; ; ( ) | ( ) ( ) | (

1 1 ik ik t M k ik t t t i t t ij

  • N

c i s

  • P
  • b

i s j s P a Σ = = = = = =

= −

µ

S*

s1 s2 s3 s4 ............... st

...................

) | ( max arg

*

S O P S

S

=

) | ( max arg

*

W O P W

W

=

Signal Processing

Hidden Markov Model

slide-3
SLIDE 3

3

Hidden Markov Model (HMM)

History

– Published in Baum’s papers in late 1960s and early 1970s – Introduced to speech processing by Baker (CMU) and Jelinek (IBM) in the 1970s

Assumption

– Speech signal can be characterized as a parametric random process – Parameters can be estimated in a precise, well-defined manner

Three fundamental problems

– Evaluation of probability (likelihood) of a sequence of

  • bservations given a specific HMM

– Determination of a best sequence of model states – Adjustment of model parameters so as to best account for

  • bserved signal
slide-4
SLIDE 4

4

Several Useful Formulas

Bayes’ Rule : Other

( ) ( )

( ) ( )

( )

( ) ( )

     = = =

∫ ∫ ∑ ∑

B B B all B all

B dB B f B A f dB B A f B B P B A P B A P A P continuous is if , disrete is if ,

( )

( ) ( )

( ) ( )

( )

B P A P A B P B P B , A P B A P = =

( )

( ) ( ) ( ) ( )

B P B A P A P A B P B A P = = ,

( ) ( ) ( ) ( ) ( ) ( )

y probabilit the describing model : , , , λ λ λ λ λ λ λ B P A P A B P B P B A P B A P = =

( ) ( ) ( ) ( )

n 2 1 n n 2 1

x .......P x P x P x ,......, ,x x P ,......,x ,x x =

2 1

t, independen are if

( ) ( ) ( ) ( ) ( ) ( )

     = =

∫ ∑

z k z

z dz z q z f z k q k z P z q E continuous : discrete :

z

slide-5
SLIDE 5

5

The Markov Chain

An Observable Markov Model

– A Markov chain with N states labeled by {1,…,N}, with the state at time t in the Markov chain denoted as qt, the parameters of a Markov chain can be described as

aij=P(qt= j|qt-1=i) 1≤i,j≤N πi =P(q1=i) 1≤i≤N

– The output of the process is the set of states at each time instant t, where each state corresponds to an

  • bservable event Xi

– There is one-to-one correspondence between the observable sequence and the Markov chain state sequence (observation is deterministic!)

(Rabiner 1989)

) ,..., , | ( ) ( ) ,..., , ( ) ,..., , | ( ) , ,..., , (

1 2 1 2 1 1 2 1 1 2 1 1 2 1

X X X X P X P X X X P X X X X P X X X X P

i i i i n n n n n − − = − − −

∑ =

∀ =

N j ij

i a

1

) all 1 (

) 1 (

1

=

∑ =

N i i

π

) | ( ) ( ) ,..., , (

1 2 1 2 1 − =

=

i i n i n

X X P X P X X X P

First-order Markov chain

n

= =

slide-6
SLIDE 6

6

The Markov Chain – Ex 1

Example 1 : a 3-state Markov Chain λ

– State 1 generates symbol A only, State 2 generates symbol B only, State 3 generates symbol C only – Given a sequence of observed symbols O={CABBCABC}, the only

  • ne corresponding state sequence is Q={S3S1S2S2S3S1S2S3}, and

the corresponding probability is P(O|λ)=P(CABBCABC|λ)=P(Q| λ )=P(S3S1S2S2S3S1S2S3 |λ) =π(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2) =0.10.30.30.70.20.30.30.2=0.00002268

[ ]

1 . 5 . 4 . 5 . 2 . 3 . 2 . 7 . 1 . 1 . 3 . 6 . =           = π A

s2 s3 A B C

0.6 0.7 0.3 0.1 0.2 0.2 0.1 0.3 0.5

s1

slide-7
SLIDE 7

7

The Markov Chain – Ex 2

Example 2: A three-state Markov chain for the Dow Jones Industrial average

( )

          = = 0.3 0.2 0.5

t i

π π

The probability of 5 consecutive up days

( ) ( ) ( )

0.0648 0.6 0.5 1,1,1,1,1 days e consecutiv 5

4 =

× = = =

11 11 11 11 1

a a a a P up P π

(Huang et al., 2001)

slide-8
SLIDE 8

8

Extension to Hidden Markov Models

HMM: an extended version of Observable Markov Model

– The observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a state – The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

  • What is hidden? The State Sequence!

According to the observation sequence, we are not sure which state sequence generates it!

slide-9
SLIDE 9

9

Hidden Markov Models – Ex 1

Example : a 3-state discrete HMM λ

– Given a sequence of observations O={ABC}, there are 27 possible corresponding state sequences, and therefore the corresponding probability is

s2 s1 s3 {A:.3,B:.2,C:.5} {A:.7,B:.1,C:.2} {A:.3,B:.6,C:.1}

0.6 0.7 0.3 0.1 0.2 0.2 0.1 0.3 0.5

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) [ ]

1 . 5 . 4 . 1 . , 6 . , 3 . 2 . , 1 . , 7 . 5 . , 2 . , 3 . 5 . 2 . 3 . 2 . 7 . 1 . 1 . 3 . 6 .

3 3 3 2 2 2 1 1 1

= = = = = = = = = =           = π C b B b A b C b B b A b C b B b A b A

( ) ( ) ( ) ( )

{ } (

) ( ) ( ) ( )

007 . 1 . 1 . 7 . , , when e.g. : , , ,

3 2 2 3 2 2 27 1 27 1

= × × = = = ∑ = ∑ =

= =

S C P S B P S A P λ P S S S ence state sequ λ P λ P λ P λ P

i i i i i i i i

Q O Q Q Q Q O Q O O

( )

( ) (

) ( )

07 . 2 . 7 . 5 .

2 3 2 2 2

= × × = = S S P S S P S λ P

i

π Q

slide-10
SLIDE 10

10

Hidden Markov Models – Ex 2

Given a three-state Hidden Markov Model for the Dow Jones Industrial average as follows:

(Huang et al., 2001)

How to find the probability P(up, up, up, up, up|λ)? How to find the optimal state sequence of the model which generates the observation sequence “up, up, up, up, up”?

slide-11
SLIDE 11

11

Elements of an HMM

An HMM is characterized by the following:

  • 1. N, the number of states in the model
  • 2. M, the number of distinct observation symbols per state
  • 3. The state transition probability distribution A={aij}, where

aij=P[qt+1=j|qt=i], 1≤i,j≤N

  • 4. The observation symbol probability distribution in state j,

B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M

  • 5. The initial state distribution π ={πi}, where π i=P[q1=i], 1≤i≤N

For convenience, we usually use a compact notation λ=(A,B,π) to indicate the complete parameter set of an HMM

– Requires specification of two model parameters (N and M)

slide-12
SLIDE 12

12

Two Major Assumptions for HMM

  • First

First-

  • order Markov assumption
  • rder Markov assumption

– The state transition depends only on the origin and destination – The state transition probability is time invariant

  • Output

Output-

  • independent assumption

independent assumption

– The observation is dependent on the state that generates it, not dependent on its neighbor observations aij=P[qt+1=j|qt=i], 1≤i,j≤N

slide-13
SLIDE 13

13

Three Basic Problems for HMMs

Given an observation sequence O=(o1,o2,…,oT), and an HMM λ=(A,B,π)

– Problem 1: How to efficiently compute P(O|λ) ? Evaluation problem – Problem 2: How to choose an optimal state sequence Q=(q1,q2,……, qT) which best explains the observations? Decoding Problem – Problem 3: How to adjust the model parameter λ=(A,B,π) to maximize P(O|λ)? Learning/Training Problem

) | , ( max arg

*

λ O Q Q

Q

P =

slide-14
SLIDE 14

14

Solution to Problem 1 - Direct Evaluation

Given O and λ, find P(O|λ)= Pr{observing O given λ} Evaluating all possible state sequences of length T that generate observation sequence O

  • : The probability of the path Q

– By first-order Markov assumption

  • : The joint output probability along the path Q

– By output-independent assumption

( ) ( ) ( ) ( )

∑ ∑

= =

Q Q

Q Q O Q O O

all all

P P P P λ λ λ λ , ,

( ) ( ) ( )

T T

q q q q q q q T t t t

a a a q q P q P P

1 3 2 2 1 1

... ,

2 1 1

= −

= = π λ λ λ Q

( )

λ Q P

( )

λ , Q O P

( ) ( )

( )

∏ ∏

= =

= =

T t t q T t t t

  • b

q

  • P

P

t

1 1

, , λ λ Q O

slide-15
SLIDE 15

15

Solution to Problem 1 - Direct Evaluation (cont.)

s2 s3 s1 O1 s2 s3 s1 s2 s3 s1 s2 s3 s1 State O2 O3 OT 1 2 3 T-1 T Time s2 s3 s1 OT-1 si means bj(ot) has been computed

aij

means aij has been computed

slide-16
SLIDE 16

16

Solution to Problem 1 - Direct Evaluation (cont.)

– Huge Computation Requirements: O(NT) (NT state sequences)

  • Exponential computational complexity

A more efficient algorithm can be used to evaluate

– The Forward Procedure/Algorithm

( ) ( ) ( )

[ ]

( ) ( ) ( )

[ ] ( )

( ) ( ) ( )

T q q q q q q ,..,q ,q q q q all T q q q q q q q q q q all

  • b

a

  • b

a

  • b
  • b
  • b
  • b

a a a P P P

T T T T T T T 1 2 2 1 2 1 1 1 2 1 1 3 2 2 1 1

..... ..... ..... ,

2 1 2 1

− −

∑ ∑ ∑

= = = π π λ λ λ Q O Q O

Q Q

( )

ADD 1 2 1 2 :

  • , N

TN MUL N T-

T T T

≈ Complexity

( )

λ O P

slide-17
SLIDE 17

17

Solution to Problem 1 - The Forward Procedure

Base on the HMM assumptions, the calculation of and involves only qt-1, qt , and

  • t , so it is possible to compute the likelihood

with recursion on t Forward variable :

– The probability of the joint event that o1,o2,…,ot are observed and the state at time t is i, given the model λ

( )

λ ,

1 − t t q

q P

( )

λ ,

t t q

  • P

( )

( )

λ i q

  • P

i

t t t

= = , ,..., ,

2 1

α

( )

λ O P

( )

( ) ( )

) ( ) ( , , , ,..., , , , ,..., ,

1 1 1 1 1 2 1 1 1 2 1 1 + = = + + + + +

      = = = = = =

∑ ∑

t j N i ij t N i t t t t t t t t

  • b

a i j q i q

  • P

j q

  • P

j α α λ λ

slide-18
SLIDE 18

18

Solution to Problem 1 - The Forward Procedure

(cont.)

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

) ( ) ( ) ( ) , | ( , ,..., , ) ( ) , , ,..., , | ( , ,..., , ) ( , , ,..., , ) ( | , ,..., , ) , | ( | , ,..., , ) | ( ) , | ( , | ,..., , ) | ( , | , ,..., , | , , ,..., ,

1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 + = + = + + = + + = + + + + + + + + + + + + + + + +

      =       = = = =       = = = =       = = = = = = = = = = = = = = = = =

∑ ∑ ∑ ∑

t j N i ij t t j N i t t t t t j N i t t t t t t j N i t t t t j t t t t t t t t t t t t t t t t t t t

  • b

a i

  • b

λ i q j q P λ i q

  • P
  • b

λ i q

  • j

q P λ i q

  • P
  • b

λ j q i q

  • P
  • b

j q

  • P

j q

  • P

j q

  • P

j q P j q

  • P

j q

  • P

j q P j q

  • P

j q

  • P

j α λ λ λ λ λ λ λ λ λ α

( )

=

B all

B A P A P ) , ( ) | ( ) , | ( ) ( ) , ( ) , ( ) , , ( ) ( ) , , ( ) | , ( λ λ λ λ λ λ λ λ λ B P B A P P B P B P B A P P B A P B A P = × = =

Output-independent assumption

) | , ( ) | ( ) , | ( λ λ λ B A P B P B A P =

( )

) ( ,

1 1 1 + + +

= =

t j t t

  • b

j q

  • P

λ

) , | ( ) | ( ) | , ( λ λ λ A B P A P B A P =

First-order Markov assumption

slide-19
SLIDE 19

19

Solution to Problem 1 - The Forward Procedure

(cont.)

α3(3)=P(o1,o2,o3,q3=3|λ) =[α2(1)*a13+ α2(2)*a23 +α2(3)*a33]b3(o3)

s2 s3 s1 O1 s2 s3 s1 s2 s3 s1 s2 s3 s1 State O2 O3 OT 1 2 3 T-1 T Time s2 s3 s1 OT-1 si means bj(ot) has been computed

aij

means aij has been computed

slide-20
SLIDE 20

20

Solution to Problem 1 - The Forward Procedure

(cont.)

Algorithm

– Complexity: O(N2T)

Based on the lattice (trellis) structure

– Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1

All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t

3.Terminat 2.

( )

( )

λ i q

  • P

i

t t t

= = , ...

2 1

α

( ) ( ) ( ) ( ) ( )

( )

( )

∑ ∑

= + = +

= ≤ ≤ ≤ ≤       = ≤ ≤ =

N i T t j N i ij t t i i

i α λ P N j , T- t ,

  • b

a i α j α N i ,

  • b

π i α

1 1 1 1 1 1

1 1 1 1 O ion Induction tion Initializa 1. T N 1)

  • 1)N(T
  • (N

: ADD T N N + 1)

  • 1)(T

+ N(N : MUL

2 2

≈ ≈

slide-21
SLIDE 21

21

Solution to Problem 1 - The Forward Procedure

(cont.)

A three-state Hidden Markov Model for the Dow Jones Industrial average

b1(up)=0.7 b2(up)= 0.1 b3(up)=0.3

α1(1)=0.5*0.7 α1(2)= 0.2*0.1 α1(3)= 0.3*0.3

a11=0.6 a21=0.5 a31=0.4

(Huang et al., 2001)

b1(up)=0.7 b2(up)= 0.1 b3(up)=0.3 π1=0.5 π2=0.2 π3=0.3

α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7

slide-22
SLIDE 22

22

Solution to Problem 2 - The Viterbi Algorithm

The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

– Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path

  • Find a single optimal state sequence Q=(q1,q2,……, qT)

– The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

slide-23
SLIDE 23

23

Solution to Problem 2 - The Viterbi Algorithm

(cont.)

State s2 s3 s1 O1 1 2 3 T-1 T time s2 s3 s1 s2 s3 s1 s2 s1 s3 s3 s2 s1 O2 O3 OT-1 OT

slide-24
SLIDE 24

24

Solution to Problem 2 - The Viterbi Algorithm

(cont.)

  • 1. Initialization
  • 2. Induction
  • 3. Termination
  • 4. Backtracking

( ) ( )

N i , i N i ,

  • b

π i

i i

≤ ≤ = Ψ ≤ ≤ = 1 ) ( 1

1 1 1

δ

( ) ( ) ( ) ( )

N j , T- t , a i j N j , T- t ,

  • b

a i j

ij t N i t j ij t N i t

≤ ≤ ≤ ≤ = Ψ ≤ ≤ ≤ ≤ =

≤ ≤ + + ≤ ≤ +

1 1 1 ] [ max arg ) ( 1 1 1 ] [ max

1 1 t 1 1 1

δ δ δ

( )

( ) ( )

i q i λ O P

T N i * T T N i

δ δ

≤ ≤ ≤ ≤

= =

1 1 *

max arg max

) ,..., , ( 1 ,..., 2 . 1 ), (

* * 2 * 1 * * 1 1 T t t * t

q q q T T t q q = − − = =

+ +

Q ψ

is the best state sequence

Complexity: O(N2T)

slide-25
SLIDE 25

25

Solution to Problem 2 - The Viterbi Algorithm

(cont.)

A three-state Hidden Markov Model for the Dow Jones Industrial average

b1(up)=0.7 b2(up)= 0.1 b3(up)=0.3 a11=0.6 a21=0.5 a31=0.4 b1(up)=0.7 b2(up)= 0.1 b3(up)=0.3 π1=0.5 π2=0.2 π3=0.3

(Huang et al., 2001) δ1(1)=0.5*0.7 δ1(2)= 0.2*0.1 δ1(3)= 0.3*0.3 δ2(1) =max (0.35*0.6+0.02*0.5+0.09*0.4)*0.7

slide-26
SLIDE 26

26

Homework #1 (due March 29)

Given a three-state Hidden Markov Model for the Dow Jones Industrial average as follows

  • P1. Please find P(up, up, unchanged, down, unchanged, down, up|λ)

using forward and backward algorithms respectively.

  • P2. Please find the optimal state sequence of the model which generates the observation

sequence “up, up, unchanged, down, unchanged, down, up” using Viterbi algorithm.

slide-27
SLIDE 27

27

Solution to Problem 3 –

The Baum-Welch Algorithm

How to adjust (re-estimate) the model parameter λ=(A,B,π) to maximize P(O|λ)?

– The most difficult of the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form

  • The data is incomplete because of the hidden state sequence

– The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm

  • The EM (Expectation Maximization) algorithm is perfectly suitable

for this problem

slide-28
SLIDE 28

28

Solution to Problem 3 –

The Backward Procedure

Backward variable :

– The probability of the partial observation sequence ot+1,ot+2,…,oT, given state i at time t and the model λ – β2(3)=P(o3,o4,…, oT|q2=3,λ) =a31* b1(o3)*β3(1)+a32* b2(o3)*β3(2)+a33* b3(o3)*β3(3)

s2 s3 s1 O1 s2 s3 s1 s2 s3 s1 s2 s3 s1 O2 O3 OT 1 2 3 T-1 T Time s2 s3 s3 OT-1 s2 s3 s1 State

( )

( )

λ , ,..., ,

2 1

i q

  • P

i

t T t t t

= =

+ +

β

slide-29
SLIDE 29

29

Solution to Problem 3 –

The Backward Procedure (cont.)

( )

( )

λ , ,..., ,

2 1

i q

  • P

i

t T t t t

= =

+ +

β

Algorithm

( ) ( ) ( ) ( )

T N ) )N(T- (N- T N ) (T- N N j , T- t j

  • b

a i N i i β

N j t t j ij t T 2 2 2 1 1 1

1 1 : ADD ; 1 2 : MUL Complexity 1 1 1 , Induction 2. 1 , 1 tion Initializa 1. ≈ ≈ ≤ ≤ ≤ ≤ = ≤ ≤ =

= + + β

β

( )

λ , ,..., , ) (

1 3 2 1

i q

  • P

i

T

= = β

( ) ( ) ( ) ( ) ( ) ( ) ( )

∑ = = = ∑ = = = ∑ = = ∑ = =

= = = = N i i i N i T N i T N i T

  • b

i i q P i q

  • P

i q

  • P

i q P i q

  • P

i q

  • P

λ P

1 1 1 1 1 1 1 1 3 2 1 1 1 3 2 1 1 1 3 2 1

) ( ) ( , , ,..., , , ,..., , , , ,..., , , O π β λ λ λ λ λ λ

slide-30
SLIDE 30

30

Solution to Problem 3 –

The Forward-Backward Algorithm

Relation between the forward and backward variables

( )

( )

( ) ( )

) ( ] [ , ...

1 1 2 1 t i N j ji t t t t t

  • b

a j i i q

  • P

i

= −

= = = α α α λ

( )

( )

( ) ( )

= + + + +

= = =

N j t t j ij t t T t t t

j

  • b

a i i q

  • P

i

1 1 1 2 1

) ( , ... β β β λ

( )

( )

λ i q P i i

t t t

= = , ) ( O β α

(Huang et al., 2001)

( )

( )

∑ =

=

N i t t

i i λ P

1

) ( β α O

slide-31
SLIDE 31

31

Solution to Problem 3 –

The Forward-Backward Algorithm (cont.) ( )

( )

λ i q P i q

  • P

i q P i q

  • P

i q

  • P

i q P i q

  • P

i q

  • P

i q

  • P

i i

t t T t t T t T t t t t t t T t t t t t t

= = = = = × = = = × = × = = = × = =

+ + + +

, ) | , ,..., , ( ) | ( ) , | ,..., , ( ) , | ,..., , ( ) | ( ) , | ,..., , ( ) , | ,..., , ( ) | , ,..., , ( ) (

2 1 2 1 2 1 2 1 2 1 2 1

O λ λ λ λ λ λ λ λ β α

( ) ( ) ∑

= =

= = =

N i t t N i t

i i λ i q P λ P

1 1

) ( ) ( , β α O O

slide-32
SLIDE 32

32

Solution to Problem 3 – The Intuitive View

Define two new variables: γt(i)= P(qt = i | O, λ)

– Probability of being in state i at time t, given O and λ

ξt( i, j )=P(qt = i, qt+1 = j | O, λ)

– Probability of being in state i at time t and state j at time t+1, given O and λ

( )

( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

∑ ∑

= = + + + + +

= = = =

N m N n t t n mn t t t j ij t t t t

n

  • b

a m j

  • b

a i λ P λ j q i q P j i

1 1 1 1 1 1 1

, , , β α β α ξ O O

( ) ( )

=

=

N j t t

j i i

1

, ξ γ

( )

( )

( ) ( )

( )

( ) ( ) ( ) ( )

∑ =

= = = =

N i t t t t t t t t

i i i i λ P i i λ P i q P i

1

) | , ( β α β α β α λ γ O O O

( )

( )

λ i q P i i

t t t

= = , ) ( O β α

( )

( )

∑ =

=

N i t t

i i λ P

1

) ( β α O

slide-33
SLIDE 33

33

Solution to Problem 3 – The Intuitive View

(cont.)

P(q3 = 3, O | λ)=α3(3)*β3(3)

State s2 s1 s3 s2 s1 s3 s2 s1 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 O1 1 2 3 4 T-1 T time O2 O3 OT OT-1 s2 s3 s1 s2 s3 s1

slide-34
SLIDE 34

34

Solution to Problem 3 – The Intuitive View

(cont.)

P(q3 = 3, q4 = 1, O | λ)=α3(3)*a31*b1(o4)*β4(1)

O1 s2 s1 s3 s2 s1 s3 s2 s1 s1 State O2 O3 OT 1 2 3 4 T-1 T time OT-1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1

slide-35
SLIDE 35

35

Solution to Problem 3 – The Intuitive View

(cont.) ξt( i, j )=P(qt = i, qt+1 = j | O, λ) γt(i)= P(qt = i | O, λ)

( )

− =

= ξ

1 1

in state to state from ns transitio

  • f

number expected ,

T t t

j i j i O

( )

− =

= γ

1 1

in state from ns transitio

  • f

number expected

T t t

i i O

slide-36
SLIDE 36

36

Solution to Problem 3 – The Intuitive View

(cont.) Reestimation formulas for π , A, and B are

( )

i t i

i 1

1) ( at time state in times)

  • f

(number freqency expected γ π = = = ( ) ( )

i i,j ξ i j i a

T- t t T- t t ij

∑ ∑

= =

= =

1 1 1 1

state from ns transitio

  • f

number expected state to state from ns transitio

  • f

number expected γ

( ) ( ) ( )

state in times

  • f

number expected symbol

  • bserving

and state in times

  • f

number expected

T 1 t T such that 1 t

∑ ∑ = =

= = =

j j j j b

t k t t k k j

γ γ

v

  • v

v

slide-37
SLIDE 37

37

Homework #2 (due April 5)

s2 s1 s3 {A:.34,B:.33,C:.33} {A:.33,B:.34,C:.33} {A:.33,B:.33,C:.34}

0.34 0.34 0.33 0.33 0.33 0.33 0.33 0.33 0.34

Given an initial model as follows:

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) [ ]

33 . 33 . 34 . 34 . , 33 . , 33 . 33 . , 34 . , 33 . 33 . , 33 . , 34 . 34 . 33 . 33 . 33 . 34 . 33 . 33 . 33 . 34 .

3 3 3 2 2 2 1 1 1

= = = = = = = = = =           = π A C b B b A b C b B b A b C b B b A b

train HMMs for the following two classes using their training data respectively.

Training set for class 1:

  • 1. ABBCABCAABC
  • 2. ABCABC
  • 3. ABCA ABC
  • 4. BBABCAB
  • 5. BCAABCCAB
  • 6. CACCABCA
  • 7. CABCABCA
  • 8. CABCA
  • 9. CABCA

Training set for class 2:

  • 1. BBBCCBC
  • 2. CCBABB
  • 3. AACCBBB
  • 4. BBABBAC
  • 5. CCA ABBAB
  • 6. BBBCCBAA
  • 7. ABBBBABA
  • 8. CCCCC
  • 9. BBAAA
slide-38
SLIDE 38

38

Homework #2 (cont.)

  • P1. Please specify the model parameters after the first and 50th iterations
  • f Baum-Welch training.
  • P2. Please show the recognition results by using the above training sequences

as the testing data (The so-called inside testing). Note that you have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training, respectively.

  • P3. Which class do the following testing sequences belong to?

ABCABCCAB AABABCCCCBBB Note that you have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training, respectively.

slide-39
SLIDE 39

39

The EM Algorithm - A Simple Example

A B

  • utput

Observed data : O → “ball sequence” Latent data : Q → “bottle sequence” Parameters to be estimated to maximize logP(O|λ) (ML criterion) λ={P(A),P(B),P(A|A), P(B|A), P(A|B), P(B|B), P(R|A),P(G|A), P(R|B), P(G|B)}

slide-40
SLIDE 40

40

The EM Algorithm

EM: Expectation Maximization

– Why EM?

  • Simple optimization algorithms for likelihood functions rely on the

intermediate variables, called latent (隱藏的)data For HMM, the state sequence is the latent data

  • Direct access to the data necessary to estimate the parameters is

impossible or difficult For HMM, it is almost impossible to estimate {A, B, π} without consideration of the state sequence

– Two Major Steps :

  • E step: calculate expectation with respect to the latent data given the

current estimate of the parameters and the observations

  • M step: estimate a new set of parameters according to Maximum

Likelihood (ML) or Maximum A Posteriori (MAP) criteria ML vs. MAP ML vs. MAP

slide-41
SLIDE 41

41

The EM Algorithm (cont.)

The EM algorithm is important to HMMs and many other model learning techniques Basic idea

– Assume we have λ and the probability that each Q=q occurred in the generation of O=o i.e., we have in fact observed a complete data pair (o,q) with frequency proportional to the probability P(O=o,Q=q|λ) – We then find a new that maximizes – It can be guaranteed that

EM can discover parameters of model λ to maximize the log-likelihood of the incomplete data, logP(O=o|λ), by iteratively maximizing the expectation of the log- likelihood of the complete data, logP(O=o,Q=q|λ)

= = = =

q

| q Q

  • O
  • O

| q Q ) , ( log ) , ( λ P λ P

λ

) ( ) ( λ P λ P |

  • O

|

  • O

= ≥ =

Expectation

slide-42
SLIDE 42

42

The EM Algorithm (cont.)

) ( ) , ( ) , ( λ P λ P λ P |

  • O
  • O

| q Q | q Q

  • O

= = = = = =

) , ( log ) , ( log ) ( log λ P λ P λ P

  • O

| q Q | q Q

  • O

|

  • O

= = − = = = =

Our goal is to maximize the log-likelihood of the observable data o generated by model , i.e.,

λ

) ( log ) ( log λ P λ P |

  • O

|

  • O

= ≥ = ) |

  • O

( log )] |

  • O

( log ) ,

  • O

| q Q ( [

q

λ P λ P λ P = = ∑ = = = expectation

) , ( ) , ( )] ,

  • O

| q Q ( log ) ,

  • O

| q Q ( [ )] | q Q ,

  • O

( log ) ,

  • O

| q Q ( [ ) |

  • O

( log

q q

λ λ λ λ H Q λ P λ P λ P λ P λ P − = ∑ = = = = − ∑ = = = = = =

( )

( )

( )

( )

λ λ H λ λ H λ λ Q λ λ Q λ P λ P , , , , ) ( log ) | ( log + − − = = − = |

  • O
  • O

) ( log ) ( log λ P λ P |

  • O

|

  • O

= ≥ =

We want

? ≥

slide-43
SLIDE 43

43

The EM Algorithm (cont.)

( )

( )

( )

( )

( ) ( ) ( )

( )

( ) ( )

( )

( )

( ) [ ]

( )

( )

, , , log , , , log , , log , , log , , log , , , ≥ + − ∴ = = = =                 = = = = = = ≤         = = = = = = = = = = = − = = = = = −

∑ ∑ ∑ ∑ ∑

λ λ H λ λ H λ P λ P λ P λ P λ P λ P λ P P λ P λ P λ P λ λ H λ λ H

q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

  • O

q Q

q q q q

λ

  • 1. Jensen’s inequality:

If f is a concave function, and X is a r.v., then E[f(X)]≤ f(E[X])

  • 2. log x ≤ x-1

( )

( )

( ) [ ]

= = = = =

q

q Q

  • O
  • O

q Q λ P λ P λ λ Q , log , ,

the Q-function (auxiliary function)

If we choose so that , then

( )

( )

λ λ Q λ λ Q , , ≥ λ

( )

( )

λ P λ P

  • O
  • O

= ≥ = log log

slide-44
SLIDE 44

44

Solution to Problem 3 - The EM Algorithm

The auxiliary function

Where and can be expressed as

( )

( )

( )

[ ]

( ) ( )

( ) ∑ ∑

      = =

Q Q

Q O O Q O Q O O Q λ P λ P λ P λ P λ P λ λ Q , log , , log , ,

( )

( )

( )

( )

∑ ∑ ∏ ∏

= = = =

+ + =             =

− −

T t t q T t q q q T t t q T t q q q

t t t t t t

b a λ P b a λ P

1 2 1 2

log log log , log ,

1 1 1 1

  • Q

O

  • Q

O π π

( )

λ P Q O ,

( )

λ P Q O, log

slide-45
SLIDE 45

45

Solution to Problem 3 - The EM Algorithm

(cont.)

Rewrite the auxiliary function as ( )

( ) ( )

( ) ( ) ( )

( )

( )

( ) ( ) ( ) ( )

( )

( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

( ) ( )

( )

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

= = = ∈ = = = − = = =

        = =         =         = = =         =         = =         = + + =         + + =

N j M k v

  • t

j t j N i N j T t ij t t ij N i i i T t t q T t q q q

k t t t t

k b λ P λ j ,q P k b λ P λ , P λ Q a λ P λ j q i ,q P a λ P λ , P λ Q λ P λ i ,q P λ P λ , P λ Q λ Q λ Q λ Q

  • b

a λ P λ , P λ λ Q

Q all b Q all a Q all π b a π Q all

O O O Q O b O O O Q O a O O O Q O π b a π O Q O

1 1 1 1 2 1 1 1 1 2

log log , log , log , log log , , , , ] log log [log ,

1 1

π π π

wi yi wj yj wk yk

( )

i

1

γ

) (i

t

γ ) , ( j i

t

ξ

slide-46
SLIDE 46

46

A Simple Example

1 2 3 time

( ) ( )

1 1

1 1

β α

( ) ( )

2 2

1 1

β α

( ) ( )

2 2

2 2

β α

( ) ( )

2 2

3 3

β α

( ) ( )

1 1

2 2

β α

( ) ( )

1 1

3 3

β α

( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

∑ = ∑ = = = = =

= = N j t t t t N j t t t t

j j i i j q P i q P P i q P i

1 1

, , , β α β α γ λ O λ O λ O λ O

( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

∑ ∑ = ∑ = = ∑ = = = = = =

= + + = + + = + = + + N j t 1 t j ij t N i t 1 t j ij t N j t t N i t t t t t

j

  • b

a i j

  • b

a i j q i q P j q i q P P j q i q P j i

1 1 1 1 1 1 1 1 1

) ( ) ( , , , , , , , β α β α ξ λ O λ O λ O λ O

The Forward/Backward Procedure

s1 s1 s1 State s2 s2 s2 x1 x2 x3

slide-47
SLIDE 47

47

A Simple Example (cont.)

1 2 1 2 1 2

4

v

7

v

4

v

11

a

12

a

22

a

21

a statrt

1

π

2

π

4 , 1 11 7 , 1 11 4 , 1 1

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

1

4 , 1 11 7 , 1 11 4 , 1 1

log log log log log log b a b a b + + + + + π

4 , 2 12 7 , 1 11 4 , 1 1

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

2

4 , 2 12 7 , 1 11 4 , 1 1

log log log log log log b a b a b + + + + + π

4 , 1 21 7 , 2 12 4 , 1 1

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

3

4 , 1 21 7 , 2 12 4 , 1 1

log log log log log log b a b a b + + + + + π

4 , 2 22 7 , 2 12 4 , 1 1

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

4

4 , 2 22 7 , 2 12 4 , 1 1

log log log log log log b a b a b + + + + + π

4 , 1 11 7 , 1 21 4 , 2 2

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

5

4 , 1 11 7 , 1 21 4 , 2 2

log log log log log log b a b a b + + + + + π

4 , 2 12 7 , 1 21 4 , 2 2

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

6

4 , 2 12 7 , 1 21 4 , 2 2

log log log log log log b a b a b + + + + + π

4 , 1 21 7 , 2 22 4 , 2 2

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

7

4 , 1 21 7 , 2 22 4 , 2 2

log log log log log log b a b a b + + + + + π

4 , 2 22 7 , 2 22 4 , 2 2

b a b a b ⋅ ⋅ ⋅ ⋅ ⋅ π

8

4 , 2 22 7 , 2 22 4 , 2 2

log log log log log log b a b a b + + + + + π

) | , ( log λ q O p

) | , ( λ q O p

q: 1 1 1 q: 1 1 2

Total 8 paths

slide-48
SLIDE 48

48

A Simple Example (cont.)

( )

paths all 8 7 6 5 4 3 2 1 + + + + + + + = all

2 2 1 1 2 1

log ) 1 ( log ) 1 ( log 8 7 6 5 log 4 3 2 1 π γ π γ π π ⋅ + ⋅ =       + + + +       + + + all all             + + + +       + + + +             + + + +       + + +

22 21 12 11

log 8 4 8 7 log 7 3 6 5 log 6 2 4 3 log 5 1 2 1 a all all a all all a all all a all all

1 = i 2 = i 1 = j 2 = j 1 = t 2 = t

slide-49
SLIDE 49

49

Solution to Problem 3 - The EM Algorithm

(cont.)

The auxiliary function is separated into three independent terms, each respectively corresponds to , , and

– Maximization procedure on can be done by maximizing the individual terms separately subject to probability constraints – All these terms have the following form

ij

a ( )

k b j

i

π ( ) ( ) ( )

∑ = ≥ ∑ = ∑ = =

= = = N n n j j j N j j N j j j N

w w y F y y y w y y y g F

1 1 1 2 1

: when value maximum has and , 1 where , log , ,..., , y y

∑ ∀ = ∑ ∀ = = ∑

= = = M k j N j ij N i i

j k b i a

1 1 1

1 ) ( , 1 , 1 π

( )

λ λ, Q

slide-50
SLIDE 50

50

Solution to Problem 3 - The EM Algorithm

(cont.)

Proof: Apply Lagrange Multiplier ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

= = = = = = = =

= ∴ − = ⇒ − = = ∀ − = ⇒ = + =         − + = =

N n n j j N j j N j j N j j N j j j j j j j N j N j j j j N j j j

w w y w w y y j y w y w y F y y w y w F

1 1 1 1 1 1 1 1

1 log log that Suppose Multiplier Lagrange applying By l l l l l l l ∂ ∂

Constraint

( ) ( ) ( ) ( ) ( )

( )

( )

x e x x h x x h x h h x h x h x h x dx x d h e

h x x h x h x h h h h h h h

1 ln 1 / 1 ln lim 1 / 1 ln lim / 1 ln lim / ln lim ) ln( ) ln( lim ln ... 71828 . 2 1 lim

/ / / 1 / / 1 / 1

= = + = + = + = + = − + = = + ≡

→ → → → → →

slide-51
SLIDE 51

51

Solution to Problem 3 - The EM Algorithm

(cont.)

( )

( ) ( )

=

        = =

N i i

λ P λ i ,q P λ Q

1 1

log , π O O π

π

wi yi

=

= N

n n i i

w w y

1

( ) ( )

( )

i P i q P

i 1 1

, γ λ λ π = = = O O

( )

λ i q P i

t t

, ) ( O = = γ ( ) ( )

1

1 1 1

= = = ∑

= = N n N n n

λ P λ n ,q P w O O

slide-52
SLIDE 52

52

Solution to Problem 3 - The EM Algorithm

(cont.)

=

= N

n n j j

w w y

1

( ) ( )

( ) ( )

∑ ∑ ∑ ∑

= − = − = − = −

= = = = =

T t t T t t T t t T t t t ij

i j i i q P j q i q P a

2 1 2 1 2 1 2 1

, , , , γ ξ λ λ O O

( )

( ) ( )

∑ ∑ ∑

= = = −

        = = =

N i N j T t ij t t

a λ P λ j q i ,q P λ Q

1 1 2 1

log , ,

a

O O a

wj yj

slide-53
SLIDE 53

53

Solution to Problem 3 - The EM Algorithm

(cont.)

=

= N

n n k k

w w y

1

wk yk

( )

( ) ( )

( ) ( )

∑ ∑ = ∑ = ∑ = =

= = = = = = T t t T k v t

  • t

t T t t T k v t

  • t

t i

i i i q P i q P k b

1 s.t. 1 1 s.t. 1

, , γ γ λ λ O O

( ) ( ) ( )

( )

∑ ∑ ∑

= = = ∈

        = =

N j M k v

  • t

j t

k t

k b λ P λ j ,q P λ Q

b

O O b

1 1

log ,

slide-54
SLIDE 54

54

Solution to Problem 3 - The EM Algorithm

(cont.)

The new model parameter set can be expressed as:

( )

B A , π = , λ

( ) ( )

( )

( ) ( )

( ) ( ) ( )

( ) ( )

( ) ( )

∑ ∑ = ∑ = ∑ = = ∑ ∑ = ∑ = ∑ = = = = = =

= = = = = = = − = − = − = − T t t T k v t

  • t

t T t t T k v t

  • t

t i T t t T t t T t t T t t t ij i

i i i q P i q P k b i j i i q P j q i q P a i P i q P

1 s.t. 1 1 s.t. 1 2 1 2 1 2 1 2 1 1 1

, , , , , , , γ γ λ λ γ ξ λ λ γ λ λ π O O O O O O

( )

( )

( )

λ j q i q P j i λ i q P i

t t t t t

, , , , ) (

1

O O = = = = =

+

ξ γ

slide-55
SLIDE 55

55

Discrete vs. Continuous Density HMMs

Two major types of HMMs according to the observations

– Discrete and finite observation:

  • The observations that all distinct states generate are finite in number,

i.e., V={v1, v2, v3, ……, vM}, vk∈RL

  • In this case, the observation probability distribution in state j,

B={bj(k)}, is defined as bj(k)=P(ot=vk|qt=j), 1≤k≤M, 1≤j≤N

  • t : observation at time t, qt : state at time t

bj(k) consists of only M probability values

– Continuous and infinite observation:

  • The observations that all distinct states generate are infinite and

continuous, i.e., V={v| v∈RL}

  • In this case, the observation probability distribution in state j,

B={bj(v)}, is defined as bj(v)=f(ot=v|qt=j), 1≤j≤N

  • t : observation at time t, qt : state at time t

bj(v) is a continuous probability density function (pdf) and is

  • ften a mixture of Multivariate Gaussian (Normal) Distributions
slide-56
SLIDE 56

56

Gaussian Distribution

A continuous random variable X is said to have a Gaussian distribution with mean μand variance σ2(σ>0) if X has a continuous pdf in the following form:

( ) ( )

        − − = =

2 2 2 / 1 2

2 exp 2 1 ) , | X ( σ µ σ π σ x µ x f

slide-57
SLIDE 57

57

Multivariate Gaussian Distribution

If X=(X1,X2,X3,…,XL) is an L-dimensional random vector with a multivariate Gaussian distribution with mean vector μ and covariance matrix Σ, then the pdf can be expressed as If X1,X2,X3,…,XL are independent random variables, the covariance matrix is reduced to diagonal, i.e.,

( ) ( ) ( ) [ ]

[ ] [ ]

[ ]

) )( (

  • f

t determinan : ( ( 2 1 exp 2 1 ) , ; ( ) (

2 T T T 1 T 2 / 1 2 / j j i i ij L

x x E E ) ) E E N f µ µ σ π − − = − = − − = =       − − − = = =

Σ Σ µµ xx µ x µ x Σ x µ µ x Σ µ x Σ Σ µ x x X

j i

ij

≠ ∀ = ,

2

σ

( ) ( )

=

        − − = =

L i ii i i ii

x f

1 2 2 2 / 1

2 exp 2 1 ) , | ( σ µ σ π Σ µ x X

slide-58
SLIDE 58

58

Multivariate Mixture Gaussian Distribution

An L-dimensional random vector X=(X1,X2,X3,…,XL) is with a multivariate mixture Gaussian distribution if In CDHMM, bj(v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions

( ) ( )

( ) ( )

= −

                − − − =

M k jk jk jk jk L jk j

c b

1 1 T 2 / 1 2 /

2 1 exp 2 1 µ v Σ µ v Σ v π

=

= ≥

M k jk jk

c c

1

1 and

Covariance matrix

  • f the kth mixture
  • f the jth state

Mean vector

  • f the kth mixture
  • f the jth state

Observation vector

w N w f

M k k M k k k k

1 , ) , ; ( ) (

1 1

= =

∑ ∑

= =

Σ µ x x

slide-59
SLIDE 59

59

Solution to Problem 3 – The Intuitive View

(CDHMM)

Define a new variable γt(j,k)

– probability of being in state j at time t with the k-th mixture component accounting for ot

( )

( ) ( ) ( )

( ) (

)

( ) (

) ( )

( ) (

) ( ) ( )

( ) ( ) ( ) ( )

( ) ( )

           ∑             ∑ = = = = = = = = = = = = = = = = = = = = =

= = M m jm jm t jm jk jk t jk N s t t t t t t t t t t t t t t t t t t t t t t t

N c N c s s j j λ j q P λ k m j q P λ j q k m P j λ j q P λ j q k m P j λ j q k m P j λ j q k m P λ j q P λ k m j q P k j

1 1

, ; , ; , , , , , , , , , , , , , , , Σ µ

  • Σ

µ

  • O

O O O O O O O β α β α γ γ γ γ

Observation-independent assumption

slide-60
SLIDE 60

60

Solution to Problem 3 – The Intuitive View

(CDHMM) (cont.) Reestimation formulas for are

jk jk jk

c Ε µ , ,

( ) ( )

j,m j,k j k j c

T t M m t T t t jk

∑ ∑ ∑

= = =

= =

1 1 1

state in times

  • f

number expected mixture and state in times

  • f

number expected γ γ

( ) ( )

∑ ∑

= =

⋅ = =

T t t T t t t jk

k j k j k j

1 1

, , mixture and state at ns

  • bservatio
  • f

(mean) average weighted γ γ

  • µ

( ) (

)( )

( )

∑ ∑

= =

− − ⋅ = =

T t t T t jk t jk t t jk

k j k j k j

1 1 T

, , mixture and state at ns

  • bservatio
  • f

covariance weighted γ γ µ

  • µ
  • Σ
slide-61
SLIDE 61

61

Solution to Problem 3 - The EM Algorithm

(CDHMM)

Express with respect to each single mixture component

( )

  • j

b

( )

  • jk

b

( )

( ) ( )

[ ]

          ∏ ∑       ∏ =      ∏       ∏ =

= = = = =

− −

T t M k t k q k q T t q q q T t t q T t q q q

t t t t t t t t t t

b c a b a λ P

1 1 2 1 2

1 1 1 1

,

  • Q

O π π

( )

( )

[ ]

, ,

1 2

1 1

     ∏       ∏ =

= =

T t t k q k q T t q q q

t t t t t t

b c a π λ P

  • K

Q O

( ) ( )

∑ ∑

=

Q K

K Q O O λ P λ P , ,

K: one of the possible mixture component sequence along with the state sequence Q

slide-62
SLIDE 62

62

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

The auxiliary function can be written as: Compared to the DHMM case, we need to further solve

( )

( )

( )

[ ]

∑∑

=

Q K

K Q O O K Q λ P λ P λ λ Q , , log , , ,

( )

( )

∑ ∑ ∑

= = =

+ + + =

T t T t k q t k q T t q q q

t t t t t t

c b a π λ P

1 1 2

log log log log , , log

1 1

  • K

Q O

( )

( ) ( )

( )

( )

c b a π

c b a π

, , , , , λ Q λ Q λ Q λ Q λ λ Q + + + =

( ) ( )

( )

[ ]

( )

( )

[ ] ∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

          = = =           = = =

T t N j M k jk t t T t N j M k t jk t t

c λ k k j q P λ Q b λ k k j q P λ Q

1 1 1 1 1 1

log , , , log , , , Ο c

  • Ο

b

c b

slide-63
SLIDE 63

63

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

The new model parameter set can be derived as

jk

c

( )

( )

[ ]

( )

jk N j M k T t t t T t N j M k jk t t

c λ k k j q P c λ k k j q P λ Q log , , log , , ,

1 1 1 1 1 1

∑ ∑∑ ∑ ∑ ∑

= = = = = =

= = =           = = = Ο Ο c

c

wk yk ( ) ( )

∑ ∑ ∑ =

= = = M m T t t T t t jk

m j k j c

1 1 1

, , γ γ

( )

) , | , ( , λ γ O k k j q P k j

t t t

= = =

slide-64
SLIDE 64

64

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

The new model parameter sets can be derived as

( ) [ ] ( )

( ) [ ] ( )

∑ ∑ ∑ ∑

= = = =

= = = = = =

T t t T t t t T t t t T t t t t jk

k j k j λ k k j q P λ k k j q P

1 1 1 1

, , , , , , γ γ

  • O
  • O

µ

( )(

)( )

[ ]

( )

( )(

)( )

[ ]

( )

∑ ∑ ∑ ∑

= = = =

− − = = = − − = = =

T t t T t jk t jk t t T t t t T t jk t jk t t t jk

k j k j λ k k j q P λ k k j q P

1 1 T 1 1 T

, , , , , , γ γ µ

  • µ
  • O

µ

  • µ
  • O

Σ

jk jk Σ

µ ,

slide-65
SLIDE 65

65

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

( ) ( )

( )

[ ]

( )

( )

[ ]

∑ ∑∑ ∑ ∑ ∑

= = = = = =

= = =           = = =

N j M k T t t jk t t T t N j M k t jk t t

b λ k k j q P b λ k k j q P λ Q

1 1 1 1 1 1

log , , log , , ,

  • Ο
  • Ο

b

b

We want to find to maximize

jk jk Σ

µ ,

( )

( ) ( )

jk t jk jk t jk t jk

L b µ

  • Σ

µ

  • Σ

− − − − =

−1 T

2 1 log 2 1 ) 2 log( 2 log π

Since We thus solve

( )

( ) ( )

2 1 log 2 1 , ,

1 1 T ,

=                   − − − − = = ∇

= − T t jk t jk jk t jk t t

λ k k j q P

jk jk

µ

  • Σ

µ

  • Σ

Ο

Σ µ

slide-66
SLIDE 66

66

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

( )(

) ( )

[ ]

( )

( ) (

)(

)

( ) ( )

[ ]

( )

[ ]

( )

∑ = = ∑ = = = ∑ = = = ∑ = = = − −       + ∑ = = =       − − ∑ = = ∇

= = = = − − = − = T t t t T t t t t jk T t t t t T t t t jk jk t jk jk T t t t jk t jk jk t T t t t jk

λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P

1 1 1 1 T 1 1 1 1 T 1

, , , , , , , , 1 , , , , Ο

  • Ο

µ

  • Ο

Ο µ µ

  • Σ

Σ Ο µ

  • Σ

µ

  • Ο

µ

( )x

C C x Cx x

T T

+ = d d ) (

slide-67
SLIDE 67

67

Solution to Problem 3 - The EM Algorithm

(CDHMM) (cont.)

T

X X X X X X X X X

⋅ ⋅ = ⋅ = 1 log log d d d d d d

jk jk

Σ Σ =

T

( )

T T T T

d d

− − −

− = X ab X X b X a

1

( )

( ) ( )

( )

( )( )

( )

( )

( )( )

( ) [ ]

( ) ( )(

)( )

[ ]

( ) ( )(

)( )

[ ]

( )(

)( )

[ ]

( )

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

= = = = − = = = − − = − − − = −

= = − − = = = − − = = = = = − − = = = = = = − − − = =      − =       − − −      − = = =                   − − − − = = ∇

T t t t T t jk t jk t t t jk T t jk t jk t t t T t jk t t jk T t jk t jk t t t T t t t T t jk jk t jk t t t jk T t jk jk t jk t jk jk t t T t jk t jk jk t jk t t

λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P λ k k j q P

jk

1 1 T 1 T 1 T 1 T 1 1 T T T 1 T T T T 1 1 T

, , , , , , , , , , , , , , 2 1 2 1 , , 2 1 log 2 1 , , Ο µ

  • µ
  • Ο

Σ µ

  • µ
  • Ο

Σ Ο Σ µ

  • µ
  • Ο

I Ο Σ µ

  • µ
  • I

Ο Σ Σ µ

  • µ
  • Σ

Σ Ο µ

  • Σ

µ

  • Σ

Ο

Σ

T

X X X X

⋅ = d d

slide-68
SLIDE 68

68

Semicontinuous HMMs

The HMM state mixture density functions are tied together across all the models to form a set of shared kernels

– The semicontinuous or tied-mixture HMM – A combination of the discrete HMM and the continuous HMM

  • A combination of discrete model-dependent weights with the

continuous codebook probability density function

– Because M is large, we can simply use the L most significant values

  • Experience showed that L is 1~3% of M is adequate

– Partial tying of for different phonetic class

( ) ( ) (

)

( ) ( )

k k M k j M k k j j

N k b f k b b Σ µ

  • v
  • ,

,

1 1

∑ ∑

= =

= =

Output probability in the discrete HMM, The mixture weight in the continuous HMM k-th mixture density function k-th codeword shared among all HMM states M is very large

( )

k

f v

  • (

)

k

f v

  • ( )

( )

( )

∑ ∑

= =

= =

M k jk jk jk jk M k jk j

b c N c b

1 1

, ,

  • Σ

µ

slide-69
SLIDE 69

69

Initialization of HMMs

A good initialization of HMM training : Segmental K-Means Segmentation into States

– Assume that we have a training set of observations and an initial estimate of all model parameters – Step 1 : The set of training observation sequences is segmented into states, based on the initial model (finding the optimal state sequence by Viterbi Algorithm) – Step 2 : For discrete density HMM (using M-codeword codebook) For continuous density HMM (M Gaussian mixtures per state) – Step 3: Evaluate the model If the difference between the previous and current model scores exceeds a threshold, go back to Step 1, otherwise stop. the initial model is saved

( )

j j k k b j state in vectors

  • f

number the state in index codebook with vectors

  • f

number the =

state

  • f

cluster in classified vectors the

  • f

matrix covariance sample state

  • f

cluster in classified vectors the

  • f

mean sample state in vectors

  • f

number by the divided state

  • f

cluster in classified vectors

  • f

number clusters

  • f

set a into state each within n vectors

  • bservatio

e cluster th j m j m j j m c M j

jm jm jm

= = = ⇒ Σ µ

slide-70
SLIDE 70

70

Initialization of HMMs (cont.)

Training Data Initial Model Model Reestimation StateSequence Segmemtation Estimate parameters

  • f Observation via

Segmental K-means Model Convergence ? NO Model Parameters YES

slide-71
SLIDE 71

71

Initialization of HMMs (cont.)

An example for discrete HMM

– 3 states and 2 codewords

  • b1(v1)=3/4, b1(v2)=1/4
  • b2(v1)=1/3, b2(v2)=2/3
  • b3(v1)=2/3, b3(v2)=1/3
  • a11=3/4, a12=1/4
  • a22=2/3, a23=1/3
  • a33=1
  • π1=1, π2=π3=0

v1 v2

O1 State O2 O3

1 2 3 4 5 6 7 8 9 10

O4 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 O5 O6 O9 O8 O7 O10

slide-72
SLIDE 72

72

Initialization of HMMs (cont.)

An example for Continuous HMM

– 3 states and 4 Gaussian mixtures per state

O1 O2

1 2 N

ON

Global mean

Cluster 1 mean Cluster 2mean

{µ13,Σ13,c13} {µ14,Σ14,c14}

State s3 s3 s2 s3 s1 s2 s3 s1 s2 s1 s2 s3 s1 s2 s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s3 s2 s1

{µ12,Σ12,c12} {µ11,Σ11,c11} K-means

slide-73
SLIDE 73

73

HMM Topology

Speech is a time-evolving non-stationary signal

– Each HMM state has the ability to capture some quasi-stationary segment in the non-stationary speech signal – A left-to-right topology is a natural candidate to model the speech signal – Each state has a state-dependent output probability distribution that can be used to interpret the observable speech signal – It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

slide-74
SLIDE 74

74

HMM Limitations

HMMs have proved themselves to be a good model of speech variability in time and feature space simultaneously

– There are a number of limitations in the conventional HMMs

  • The state duration follows an exponential distribution

– Don’t provide adequate representation of the temporal structure of speech

  • First order (Markov) assumption: the state transition depends only on the

previous state

  • Output-independent assumption: all observation frames are dependent on

the state that generated them, not on neighboring observation frames

HMMs are well defined only for processes that are a function of a single independent variable, such as time or one-dimensional position Although speech recognition remains the dominant field in which HMMs are applied, their use has been spreading steadily to other fields

( ) ( )

ii t ii i

a a t d − =

− 1 1

slide-75
SLIDE 75

75

ML vs. MAP

Estimation principle based on observations O=[o1, o2, ……, oN]

– The Maximum Likelihood (ML) principle: find the model parameter λ so that the likelihood P(O|λ) is maximum

  • for example, if λ ={µ,Σ} is the parameters of a multivariate normal

distribution, and O is i.i.d. (independent, identically distributed), then the ML estimate of λ ={µ,Σ} is

– The Maximum a Posteriori (MAP) principle: find the model parameter λ so that the likelihood P(λ |O) is maximum

( )( )

∑ ∑

= =

− − = =

N i ML i ML i ML N i i ML

N N

1 T 1

1 , 1 µ

  • µ
  • Σ
  • µ

back

) | ( max arg

*

λ λ

λ

O P =

) ( ) | ( max arg ) ( ) ( ) | ( max arg ) | ( max arg

*

λ λ λ λ λ λ

λ λ λ

P P P P P P O O O O = = =

slide-76
SLIDE 76

76

Appendix - 常用的數學符號

slide-77
SLIDE 77

77

Appendix - Matrix Calculus

ij ij i j i i ij ij i i i

dX dy is element j i whose matrix a is d dy dx dX is element j i whose matrix a is dx d dx dy is element j i whose matrix a is d d dx dy is element th i whose vector a is d dy dx dy is element th i whose vector a is dx d and matrix in element j i column th j row th i the be X B n n dimension

  • f

matrix a is and vector in element th i the are b x a n dimension

  • f

vectors are ) , ( ) , ( ) , ( ) , ( , , , , , , , ,

T

X X x y x y X B X B b x a b x a − − − − × −

Notation:

            =

n

x x x M

2 1

x

            =

nn n n n n

X X X X X X X X X K M O M M K K

2 1 2 22 21 1 12 11

X

                    =

nn n n n n

dX dy dX dy dX dy dX dy dX dy dX dy dX dy dX dy dX dy d dy K M O M M K K

2 1 2 22 21 1 12 11

X

                    =

n

dx dy dx dy dx dy d dy M

2 1

x

slide-78
SLIDE 78

78

Appendix - Matrix Calculus (cont.)

Property 1:

– proof

T T

) ( ab X Xb a = d d

( ) ( )

T T 1 1 T 1 1 T

) ( ) ( ab X Xb a Xb a Xb a = ∴ =                   = ∴         =

∑ ∑ ∑ ∑

= = = =

d d b a dX X a b d dX d X a b

t k kt n i n j ji j i kt n i n j ji j i

Q

[ ]

            ×         =             ×             ×

∑ ∑ ∑

= = = n n j jn j n j j j n j j j n nn n n n n n

b b b X a X a X a b b b X X X X X X X X X a a a M K M K M O M M K K K

2 1 1 1 2 1 1 2 1 2 1 2 22 21 1 12 11 2 1

[ ]

            = ×            

n n n n n n n n

b a b a b a b a b a b a b a b a b a b b b a a a K M O M M K K K M

2 1 2 2 2 1 2 1 2 1 1 1 2 1 2 1

slide-79
SLIDE 79

79

Appendix - Matrix Calculus (cont.)

T T T

) ( ba X b X a = d d

Property 1 - Extension:

– proof

[ ]

            ×         =             ×             ×

∑ ∑ ∑

= = = n n j nj j n j j j n j j j n nn n n n n n

b b b X a X a X a b b b X X X X X X X X X a a a M K M K M O M M K K K

2 1 1 1 2 1 1 2 1 2 1 2 22 12 1 21 11 2 1

( ) ( )

T T 1 1 T T 1 1 T T

) ( ) ( ba X Xb a b X a b X a = ∴ =                   = ∴         =

∑ ∑ ∑ ∑

= = = =

d d a b dX X a b d dX d X a b

t k kt n i n j ij j i kt n i n j ij j i

Q [ ]

            = ×            

n n n n n n n n

a b a b a b a b a b a b a b a b a b a a a b b b K M O M M K K K M

2 1 2 2 2 1 2 1 2 1 1 1 2 1 2 1

slide-80
SLIDE 80

80

Appendix - Matrix Calculus (cont.)

( ) ( )

( )x

C C Cx x Cx x Cx x

T T 1 1 2 1 1 T 1 1 T

x ) ( 2 ) ( + = ⇒ + =          ≠ = ≠ = = = = =                   = ∴         =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

= = ≠ ≠ = = = =

d d C x C x i k j C x j k i C x k j i C x dx C dx dx C x x d dx d C x x

n t tk t n t kt t k t kt t k t tk t kk k k kk k k n i n j ji j i k n i n j ji j i

Q

Property 2:

– proof

( )x

C C x Cx x

T T

) ( + = d d

back

                    =             ×            

∑ ∑ ∑

= = = n t tn t n t t t n t t t n nn n n n n

C x C x C x x x x C C C C C C C C C

1 1 2 1 1 2 1 2 1 2 22 21 1 12 11

M M K M O M M K K

slide-81
SLIDE 81

81

Appendix - Matrix Calculus (cont.)

Property 3:

– proof

( )

[ ]

B X BX = d tr d

T

( ) ( ) ( ) ( )

[ ]

( )

[ ]

B X BX BX BX BX X X B B X B BX = ∴ = ⇒ = = ∴ = =

∑∑ ∑ ∑

= = = =

d tr d B dX tr d X B tr X B

ij ij n k n t kt kt n k kk j i n t jt it j i ij T T 1 1 1 T T 1 T T

)

  • f

row th

  • j

the : ,

  • f

row th

  • i

the : ( Q

                       

nn n n n n nn n n n n

X X X X X X X X X B B B B B B B B B K M O M M K K K M O M M K K

2 1 2 22 12 1 21 11 2 1 2 22 21 1 12 11

slide-82
SLIDE 82

82

Appendix - Matrix Calculus (cont.)

( ) [ ]

T

X X X X

⋅ = ) det( det d d

Property 4:

– proof

( )

uv uv uv nn n un u n uv nn un n u n u n u un un u u u u u u n un u u u uk n i ki ui n n v n v n n n u v u v u u n u v u v u u n v v v u uv n n v n v n v n n n u v u v u v u u n u v u v u v u u n u v u v u v u u n v v v

W dX d X X X X X X W X W X W X W X W X W X W X W X W X W X W X W X W ,...,n k I X W u X X X X X X X X X X X X X X X X dX d X X X X X X X X X X X X X X X X X X X X X X X X X

T

  • 1

1 1 11 3 3 2 2 1 1 3 3 2 2 1 1 1 13 3 12 2 11 1 1 T T

  • ,

1 , 1 , 1 , , 1 1 , 1 1 , 1 1 , 1 , 1 1 , 1 1 , 1 1 , 1 1 1 , 1 1 , 1 1 , 1 , 1 , , 1 , 1 , , 1 1 , 1 , 1 1 , 1 1 , 1 , 1 , , 1 , 1 , , 1 1 , 1 , 1 1 , 1 1 , 1 1 1 , 1 , 1 1 , 1 1 , 1

1 1 1 , , then , Let ) 1 ( , X X X X X I X W X W X X = = ⇒ =          = + + + + = + + + + = + + + + ⇒ = = ∑ ⇒ ∀ = ⋅ = − = =

= + − + + + − − − − + − − − − + − + + − + + + + − + + + − − + − − − − − + −

K K M O M O M K K M O M O M K K L M L M L M M K M M K K

back

v u

slide-83
SLIDE 83

83

Appendix - Matrix Calculus (cont.)

[ ]

T T T T T

CXba Xab C X CXb X a + = d d

Property 5:

– proof

[ ]

( )

( ) ( )

( )

( )

( )

( )

( )

T T T 1 1 T T 1 1 1 1 1 1 T T T 1 1

, , , 2 , , , and and Let CXba Xab C CXb X a Cv u CXb X a Xb v Xa u + = + =          = ≠ ≠ = = = =              = ≠         ≠ =                 = = = ∴                         =         = = = = ∴ = =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

= = ≠ ≠ ≠ ≠ = = = = = = = = = n i iv ui v v n j jv ju v v n u i ui v v iv n u j ju v jv v uv ju v v uv n u i ui v uv v iv uv n u j ju v jv v uv uv ju v uv v uv v k uv n i n j ji n k k jk n k k ik n i n j ji j i n k k ik i n k k ik i

X C b a X C b a u j u i C a b X u j u i C a X b u j u i X C b a u j u i dX C a X b X d u j u i dX C a X b X d u j u i dX C a X b X d dX d C a X b X C u v b X v a X u

slide-84
SLIDE 84

84

Appendix - Matrix Calculus (cont.)

Property 6:

– proof

( )

A xx Ax x

T T

tr =

( )

( ) ( )

[ ]

) ( ) (

T T 1 1 1 1 T 1 T T 1 1 1 1 T

A xx Ax x xx A xx A xx Ax x tr A x x A tr A x x A x x

n i n j ji j i n i n j ji ij n i ii n i n j ij j i n i n j ij j i

= ∴ = = = =         =

∑∑ ∑∑ ∑ ∑∑ ∑ ∑

= = = = = = = = =

[ ]

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

+ + + =             ×         =             ×             ×

n j nj j n n j j j n j j j n n j nj j n j j j n j j j n nn n n n n n

A x x A x x A x x x x x A x A x A x x x x A A A A A A A A A x x x

1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 2 22 12 1 21 11 2 1

L M K M K M O M M K K K [ ]

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

+ + + =                       ×             =                           × ×            

n j nj j n n j j j n j j j n j nj j n j j j n j j j n nn n n n n n n

A x x A x x A x x A x A x A x x x x tr A A A A A A A A A x x x x x x tr

1 1 2 2 1 1 1 1 1 2 1 1 2 1 2 1 2 22 12 1 21 11 2 1 2 1

L K M K M O M M K K K M

slide-85
SLIDE 85

85

Appendix - Matrix Calculus (cont.)

a x x a x a x = = d d d d ) ( ) (

T T

Property 7 :

– proof

i i n k k k i i

a dx x a d dx d dx d = = =

=1 T T

) ( ) ( x a a x

            =                     =

∑ ∑ ∑

n n i i i i i i

a a a dx x a d dx x a d dx x a d d d M M

2 1 2 1

) ( x x aT

slide-86
SLIDE 86

86

Appendix - Matrix Calculus (cont.)

Property 8 :

– proof

( )

AXB XB A X AXB X + =

T T T

) ( d tr d

( )

( )

( ) ( ) ( ) { } ( ) ( ) ( )

( ) ( ) ( )

∑        ∑ = ∑ = ∑        ∑ = ∑ = − + ∑               ∑ + + ∑               ∑ =           − ∑               ∑ + ∑               ∑ = ∑           ∑               ∑         ∑ = ∑               ∑         ∑ = ∑ = ∑ =

= = = = = = = = = = = = = = = = = i m mv im ui i iv ui uv k vk l lk lu k kv uk uv vv uu uv uv vv uu n k vk n l lu lk vv uu uv n i n m mv im ui uv vv uv uu uv n k vk uv n l lu lk n i n m mv im ui uv uv n k n i n m mk im n l li lk n i n m mk im n l li lk kk n m mk im ik n l li lk ki

B X A B X A B A X X B A B A X B A X B X A dX B X A X B X A X B X A X d dX tr d B X A X tr B X A X B X A X XB A AXB B X A XB A AXB X AXB X AXB X XB A X

T T T T 1 1 1 1 1 1 1 1 T 1 1 1 1 T 1 1 1 T 1 1 T

2 ,

slide-87
SLIDE 87

87

Appendix - Matrix Calculus (cont.)

Property 9:

– proof

( )

T T T 1 T − − −

− = X ab X X b X a d d

back

( )

( ) ( )

( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

T T T 1 T T T T T T T T T 1 T T 1 T 1 T 1 T 1 T T T 1 1 1 1 1 1 1

Hence elsewhere. zeros and position in 1 a containing matrix a is that Note .

  • f

column th the is where ,

− − − − − − − − − − − − − − − − − − −

− = − = − = − = − = = ⋅ ⋅ − = ⇒ = + ⋅ ⇒ = = ⋅ ∴ = ⋅ X ab X X b X a X ab X e X ab X e b X e e X a b X e e X a b X a e e I e e e X X X X X X X X X I X X I X X d d dX d u, v u dX d dX d dX d dX d dX d dX d dX d

uv v u v u v u uv v u u v u uv uv uv uv uv uv uv

Q