HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: - - PowerPoint PPT Presentation

hmms for acoustic modeling part ii
SMART_READER_LITE
LIVE PREVIEW

HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: - - PowerPoint PPT Presentation

HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi Recap: HMMs for Acoustic Modeling What are (first-order) HMMs? What are the simplifying assumptions governing HMMs? What are the three fundamental problems


slide-1
SLIDE 1

Instructor: Preethi Jyothi

HMMs for Acoustic Modeling (Part II)

Lecture 3

CS 753

slide-2
SLIDE 2

Recap: HMMs for Acoustic Modeling

What are (first-order) HMMs? What are the simplifying assumptions governing HMMs? What are the three fundamental problems related to HMMs?

  • 1. What is the forward algorithm? What is it used to compute?

Computing Likelihood: Given an HMM λ = (A,B) and an observa- tion sequence O, determine the likelihood P(O|λ).

  • 2. What is the Viterbi algorithm? What is it used to compute?

Decoding: Given as input an HMM λ = (A,B) and a sequence of ob- servations O = o1,o2,...,oT, find the most probable sequence of states Q = q1q2q3 ...qT.

slide-3
SLIDE 3

Problem 3: Learning in HMMs

Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation se- quence O, determine the likelihood P(O|λ). Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.

Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm

slide-4
SLIDE 4

Forward and Backward Probabilities

Baum-Welch algorithm iteratively estimates transition & observation probabilities and uses these values to derive even better estimates. Require two probabilities to compute estimates for the transition and

  • bservation probabilities:
  • 1. Forward probability: Recall
  • 2. Backward probability:

αt(j) = P(o1,o2 ...ot,qt = j|λ) βt(i) = P(ot+1,ot+2 ...oT|qt = i,λ)

slide-5
SLIDE 5

Backward probability

  • 1. Initialization:

βT(i) = 1, 1 ≤ i ≤ N

  • 2. Recursion

βt(i) =

N

X

j=1

aij bj(ot+1) βt+1( j), 1 ≤ i ≤ N,1 ≤ t < T

  • 3. Termination:

P(O|λ) =

N

X

j=1

πj bj(o1) β1(j)

slide-6
SLIDE 6

Visualising backward probability computation

  • t+1
  • t

ai1 ai2 aiN ai3 b1(ot+1) βt(i)= Σj βt+1(j) aij bj(ot+1)

q1 q2 q3 qN q1 qi q2 q1 q2

  • t-1

q3 qN

βt+1(N) βt+1(3) βt+1(2) βt+1(1)

b2(ot+1) b3(ot+1) bN(ot+1)

Figure A.11 The computation of β i by summing all the successive values β j

slide-7
SLIDE 7
  • 1. Baum-Welch: Estimating !aij

which works out to be

ξt(i, j) = αt(i)ai jbj(ot+1)βt+1( j) PN

j=1 αt( j)βt(j)

  • t+2
  • t+1

αt(i)

  • t-1
  • t

aijbj(ot+1) si sj βt+1(j)

ξt(i, j) = P(qt = i,qt+1 = j|O,λ) first compute a probability which is similar

where We need to define to estimate aij

ξt(i, j)

ˆ aij = PT−1

t=1 ξt(i, j)

PT−1

t=1

PN

k=1 ξt(i,k)

Then,

slide-8
SLIDE 8

γt( j) = P(qt = j|O,λ)

where

γt( j) = αt( j)βt(j) P(O|λ)

which works out to be

  • t+1

αt(j)

  • t-1
  • t

sj βt(j)

  • 2. Baum-Welch: Estimating !bj(vk)

We need to define to estimate bj(vk)

γt(j)

ˆ bj(vk) = PT

t=1s.t.Ot=vk γt( j)

PT

t=1 γt( j)

in Eq. 9.38 and Eq. 9.43 to re-estimate

Then, for discrete outputs State occupancy
 probability

slide-9
SLIDE 9

Bringing it all together: Baum-Welch

Estimating HMM parameters iteratively using the EM algorithm. 
 For each iteration, do: E step: For all time-state pairs, compute the state occupation 
 probabilities 훾t(j) and ξt(i, j) M step: Reestimate HMM parameters, i.e. transition probabilities, 


  • bservation probabilities, based on the estimates derived in the E step
slide-10
SLIDE 10

Baum-Welch algorithm (pseudocode)

function FORWARD-BACKWARD(observations of len T, output vocabulary V, hidden state set Q) returns HMM=(A,B) initialize A and B iterate until convergence E-step γt( j) = αt( j)βt(j) αT (qF) ∀ t and j ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT (qF) ∀ t, i, and j M-step ˆ aij =

T−1

X

t=1

ξt(i, j)

T−1

X

t=1 N

X

k=1

ξt(i,k) ˆ b j(vk) =

T

X

t=1s.t. Ot=vk

γt( j)

T

X

t=1

γt(j) return A, B

slide-11
SLIDE 11

Discrete to continuous outputs

We derived Baum-Welch updates for discrete outputs. However, HMMs in acoustic models emit real-valued vectors as observations. Before we understand how Baum-Welch works for acoustic modelling using HMMs, let’s look at an overview of the Expectation Maximization (EM) algorithm and establish some notation.

slide-12
SLIDE 12

Observed data: i.i.d samples xi, i=1, …, N Goal: Find where Initial parameters: θ0 (x is observed and z is hidden) Iteratively compute θl as follows:

EM Algorithm: Fitting Parameters to Data

Q(θ, θ`−1) =

N

X

i=1

X

z

Pr(z|xi; θ`−1) log Pr(xi, z; θ) θ` = arg max

Q(θ, θ`−1) L(θ) =

N

X

i=1

log Pr(xi; θ) arg max

θ

L(θ) L(θ) − L(θ`−1) ≥ Q(θ, θ`−1) − Q(θ`−1, θ`−1)

Estimate θl cannot get worse over iterations because for all θ:

EM is guaranteed to converge to a local optimum or saddle points [Wu83]

slide-13
SLIDE 13

Coin example to illustrate EM

  • 휌1 = Pr(H)

휌2 = Pr(H) 휌3 = Pr(H) The following sequence is observed: “HH, TT, HH, TT, HH” How do you estimate 휌1, 휌2 and 휌3? Toss privately 
 if it shows H: 
 Toss twice
 else
 Toss twice Repeat:

slide-14
SLIDE 14

Coin example to illustrate EM

Recall, for partially observed data, the log likelihood is given by: ∈

  • each observation xi

where, for the coin example:

X = {HH,HT,TH,TT}

  • the hidden variable

z Z = {H,T} L(θ) =

N

X

i=1

log Pr(xi; θ) =

N

X

i=1

log X

z

Pr(xi, z; θ)

slide-15
SLIDE 15

Coin example to illustrate EM

Recall, for partially observed data, the log likelihood is given by:

Pr(x, z; θ) = Pr(x|z; θ) Pr(z; θ)

where Pr(z; θ) =

( ρ1 if z = H 1 − ρ1 if z = T

  • 휌1 = Pr(H)
  • 휌2 =Pr(H)

휌3 = Pr(H)

L(θ) =

N

X

i=1

log Pr(xi; θ) =

N

X

i=1

log X

z

Pr(xi, z; θ)

h : number of heads, t : number of tails

Pr(x|z; θ) = ( ρh

2(1 − ρ2)t

if z = H ρh

3(1 − ρ3)t

if z = T

slide-16
SLIDE 16

Our observed data is: {HH, TT, HH, TT, HH} Let’s use EM to estimate θ = (휌1, 휌2, 휌3) = 0.16 = 0.49 What is 훾(H, TT)? What is 훾(H, HH)? Suppose θl -1 is 휌1 = 0.3, 휌2 = 0.4, 휌3 = 0.6: [EM Iteration, E-step]
 Compute quantities involved in 
 where 훾(z, x) = Pr(z | x ;θl -1)

Q(θ, θ`−1) =

N

X

i=1

X

z

γ(z, xi) log Pr(xi, z; θ)

Coin example to illustrate EM

i.e., compute 훾(z, xi) for all z and all i

slide-17
SLIDE 17

Our observed data is: {HH, TT, HH, TT, HH} [EM Iteration, M-step]
 Find θ which maximises

Q(θ, θ`−1) =

N

X

i=1

X

z

γ(z, xi) log Pr(xi, z; θ)

Coin example to illustrate EM

ρ2 = PN

i=1 γ(H, xi)hi

PN

i=1 γ(H, xi)(hi + ti)

ρ1 = PN

i=1 γ(H, xi)

N ρ3 = PN

i=1 γ(T, xi)hi

PN

i=1 γ(T, xi)(hi + ti)

Let’s use EM to estimate θ = (휌1, 휌2, 휌3)

slide-18
SLIDE 18

Coin example to illustrate EM

This was a very simple HMM 
 (with observations from 2 states) State remains the same after the first transition γ estimated the distribution of this state More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972)
 (predates the general formulation of EM (1977))

H T

휌1 1-휌1 H/휌2 T/1-휌2 H/휌3 T/1-휌3 1 1

slide-19
SLIDE 19

Observed data: N sequences, xi, i=1…N where xi ∈ V Parameters θ : transition matrix A, observation probabilities B 
 [EM Iteration, E-step]
 Compute quantities involved in Q(θ,θl -1) 훾i,t (j) = Pr(zt = j | xi ;θl -1)
 훏i,t(j,k) = Pr(zt = j, zt+1 = k | xi ;θl -1)

Baum-Welch Algorithm as EM

slide-20
SLIDE 20

Parameters θ : transition matrix A, observation probabilities B 
 [EM Iteration, M-step]
 Find θ which maximises Q(θ,θl -1)

Baum-Welch Algorithm as EM

Bj,v = PN

i=1

P

t:xit=v γi,t(j)

PN

i=1

PTi

t=1 γi,t(j)

Aj,k = PN

i=1

PTi1

t=1 ξi,t(j, k)

PN

i=1

PTi1

t=1

P

k0 ξi,t(j, k0)

<latexit sha1_base64="uVFnsJYIYcB5KF/iC5IN0q4U0tA=">ACY3ichVHLSgMxFM2MWrW+xsdOhGARLdQyUwXdFKpuXIlCq0KnDpk0Y2MzD5I7Yh3mJ925c+N/mD4WgUvBM495x5ucuIngiuw7XfDnJmdK8wvLBaXldW16z1jVsVp5KyFo1FLO9opjgEWsB8HuE8lI6At25/cvhvrdM5OKx1ETBgnrhOQx4gGnBDTlWa9nXvZU6e4jt1AEpq5Kg29jNed/OEKjxsYNlnT4dOjt0XruUK5AfaVs7/nx9R/f0p534596ySXbVHhX8DZwJKaFLXnvXmdmOahiwCKohSbcdOoJMRCZwKlhfdVLGE0D5ZG0NIxIy1clGeV4TzNdHMRSnwjwiP3uyEio1CD09WRIoKemtSH5l9ZOITjtZDxKUmARHS8KUoEhxsPAcZdLRkEMNCBUcn1XTHtERw36W4o6BGf6yb/Bba3qHFVrN8elxvkjgW0jXbRAXLQCWqgS3SNWoiD6NgrBmW8WkumRvm1njUNCaeTfSjzJ0vOj62Q=</latexit>

Observed data: N sequences, xi, i=1…N where xi ∈ V

slide-21
SLIDE 21

Discrete to continuous outputs

We derived Baum-Welch updates for discrete outputs. However, HMMs in acoustic models emit real-valued vectors as observations. Use probability density functions to define observation probabilities

If were 1D values, HMM observation probabilities: where is the mean associated with state and is its variance

x bj(x) = 풩(x|μj, σ2

j )

μj j σ2

j

If , then we use multivariate Gaussians, 
 where is the covariance matrix associated with state j

x ∈ ℝd bj(x) = 풩(x|μj, Σj) Σj

slide-22
SLIDE 22

BW for Gaussian Observation Model

Parameters θ : transition matrix A, observation prob. B = {(μj,Σj)} for all j 
 [EM Iteration, M-step]
 Find θ which maximises Q(θ,θl -1)

µj = PN

i=1

PTi

t=1 γi,t(j)xit

PN

i=1

PTi

t=1 γi,t(j)

Σj = PN

i=1

PTi

t=1 γi,t(j)(xit − µj)(xit − µj)T

PN

i=1

PTi

t=1 γi,t(j)

A same as with discrete outputs Observed data: N sequences, xi = (xi1, …, xiTi), i=1…N where xit ∈ ℝd B = {(μj,Σj)} for all j

slide-23
SLIDE 23

Gaussian Mixture Model

  • More generally, we use a “mixture of Gaussians” to

allow for acoustic vectors associated with a state to be non-Gaussian.

  • Instead of

in the single Gaussian case, can be an M-component mixture model:


bj(x) = 풩(x|μj, Σj) bj(x)

where cjm is the mixing probability for Gaussian component m of state j

M

X

m=1

cjm = 1, cjm ≥ 0

bj(x) =

M

X

m=1

cjmN(x|µjm, Σjm)

<latexit sha1_base64="ZpmSZEgz1V14OmOhizYlXlrOw=">ACRnicbVBNS8QwEJ2uX+v6terRS3ARFGRpVdCLIHrxoqzoqrCtJc2mazRpS5KS+2v8+LZmz/BiwdFvJrWPfg1EPJ47w0z84KEM6Vt+8mqDA2PjI5Vx2sTk1PTM/XZuVMVp5LQNol5LM8DrChnEW1rpjk9TyTFIuD0LjeK/SzGyoVi6MT3U+oJ3AvYiEjWBvKr3uBf7XsCqwvgzC7zVfQNnJVKvxMbDv5xQEifnYlclQ6CObZYf7NfecGMe+qvjCfK9LSuoes57AJV7x6w27aZeF/gJnABowqJZf3S7MUkFjThWKmOYyfay7DUjHCa19xU0QSTa9yjHQMjLKjysjKGHC0ZpovCWJoXaVSy3zsyLFSxrHEWJ6jfWkH+p3VSHW5GYuSVNOIfA0KU450jIpMUZdJSjTvG4CJZGZXRC6xESb5GsmBOf3yX/B6VrTW+uHW0dnYHcVRhARZhGRzYhB3Yhxa0gcA9PMrvFkP1ov1bn18WSvWoGceflQFPgGPU7MV</latexit>
  • Assuming that observations associated with a state

follow a Gaussian distribution is too simplistic.

slide-24
SLIDE 24

BW for Gaussian Mixture Model

Parameters θ : transition matrix A, observation prob. B = {(μjm,Σjm,cjm)} for all j,m 
 [EM Iteration, M-step]
 Find θ which maximises Q(θ,θl -1)

µjm = PN

i=1

PTi

t=1 γi,t(j, m)xit

PN

i=1

PTi

t=1 γi,t(j, m)

Σjm = PN

i=1

PTi

t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T

PN

i=1

PTi

t=1 γi,t(j, m)

cjm = PN

i=1

PTi

t=1 γi,t(j, m)

PN

i=1

PTi

t=1

PM

m0=1 γi,t(j, m0)

<latexit sha1_base64="CDwhDXmujKSHp4qwkxirtQ7FuFA=">ACaHichVFdS8MwFE3r9/yqX4j4Ehyigox2CvoiL74oig4FdZ0iydcUlbklthlOJ/9M0f4Iu/wnTbgzrBC4Fz7mHm5yEqeAaXPfdsfGJyanpmcqs3PzC4vO0vKdTjJFWYMmIlEPIdFM8Jg1gINgD6liRIaC3Yfd81K/f2FK8yS+hV7KWpJ0Yh5xSsBQgfNKg/xZFvgE+5EiNPd1JoOcn3jF4xUeNFA2+W3AC+x3iJTE6PtQ7D7vy73iP0OfkDslczli39krAqfq1tx+4VHgDUEVDes6cN78dkIzyWKgmjd9NwUWjlRwKlgRcXPNEsJ7ZIOaxoYE8l0K+8HVeBtw7RxlChzYsB9rsjJ1LrngzNpCTwpH9rJfmX1swgOm7lPE4zYDEdLIoygSHBZeq4zRWjIHoGEKq4uSumT8QEDuZvKiYE7/eTR8FdveYd1Oo3h9XTs2Ec02gTbaFd5KEjdIou0DVqIo+rFlr1VqzPm3HXrc3BqO2NfSsoB9lb30BIRG5Eg=</latexit>

Observed data: N sequences, xi = (xi1, …, xiTi), i=1…N where xit ∈ ℝd

  • Prob. of component m

  • f state j at time t

B = {(μjm,Σjm,cjm)} for all j,m

slide-25
SLIDE 25

Baum Welch: In summary

[Every EM Iteration]
 Compute θ = { Ajk, (μjm,Σjm,cjm) } for all j,k,m

µjm = PN

i=1

PTi

t=1 γi,t(j, m)xit

PN

i=1

PTi

t=1 γi,t(j, m)

Σjm = PN

i=1

PTi

t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T

PN

i=1

PTi

t=1 γi,t(j, m)

cjm = PN

i=1

PTi

t=1 γi,t(j, m)

PN

i=1

PTi

t=1

PM

m0=1 γi,t(j, m0)

<latexit sha1_base64="CDwhDXmujKSHp4qwkxirtQ7FuFA=">ACaHichVFdS8MwFE3r9/yqX4j4Ehyigox2CvoiL74oig4FdZ0iydcUlbklthlOJ/9M0f4Iu/wnTbgzrBC4Fz7mHm5yEqeAaXPfdsfGJyanpmcqs3PzC4vO0vKdTjJFWYMmIlEPIdFM8Jg1gINgD6liRIaC3Yfd81K/f2FK8yS+hV7KWpJ0Yh5xSsBQgfNKg/xZFvgE+5EiNPd1JoOcn3jF4xUeNFA2+W3AC+x3iJTE6PtQ7D7vy73iP0OfkDslczli39krAqfq1tx+4VHgDUEVDes6cN78dkIzyWKgmjd9NwUWjlRwKlgRcXPNEsJ7ZIOaxoYE8l0K+8HVeBtw7RxlChzYsB9rsjJ1LrngzNpCTwpH9rJfmX1swgOm7lPE4zYDEdLIoygSHBZeq4zRWjIHoGEKq4uSumT8QEDuZvKiYE7/eTR8FdveYd1Oo3h9XTs2Ec02gTbaFd5KEjdIou0DVqIo+rFlr1VqzPm3HXrc3BqO2NfSsoB9lb30BIRG5Eg=</latexit>

Aj,k = PN

i=1

PTi

t=2 ξi,t(j, k)

PN

i=1

PTi

t=2

P

k0 ξi,t(j, k0)