HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: - - PowerPoint PPT Presentation
HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: - - PowerPoint PPT Presentation
HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi Recap: HMMs for Acoustic Modeling What are (first-order) HMMs? What are the simplifying assumptions governing HMMs? What are the three fundamental problems
Recap: HMMs for Acoustic Modeling
What are (first-order) HMMs? What are the simplifying assumptions governing HMMs? What are the three fundamental problems related to HMMs?
- 1. What is the forward algorithm? What is it used to compute?
Computing Likelihood: Given an HMM λ = (A,B) and an observa- tion sequence O, determine the likelihood P(O|λ).
- 2. What is the Viterbi algorithm? What is it used to compute?
Decoding: Given as input an HMM λ = (A,B) and a sequence of ob- servations O = o1,o2,...,oT, find the most probable sequence of states Q = q1q2q3 ...qT.
Problem 3: Learning in HMMs
Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation se- quence O, determine the likelihood P(O|λ). Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.
Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm
Forward and Backward Probabilities
Baum-Welch algorithm iteratively estimates transition & observation probabilities and uses these values to derive even better estimates. Require two probabilities to compute estimates for the transition and
- bservation probabilities:
- 1. Forward probability: Recall
- 2. Backward probability:
αt(j) = P(o1,o2 ...ot,qt = j|λ) βt(i) = P(ot+1,ot+2 ...oT|qt = i,λ)
Backward probability
- 1. Initialization:
βT(i) = 1, 1 ≤ i ≤ N
- 2. Recursion
βt(i) =
N
X
j=1
aij bj(ot+1) βt+1( j), 1 ≤ i ≤ N,1 ≤ t < T
- 3. Termination:
P(O|λ) =
N
X
j=1
πj bj(o1) β1(j)
Visualising backward probability computation
- t+1
- t
ai1 ai2 aiN ai3 b1(ot+1) βt(i)= Σj βt+1(j) aij bj(ot+1)
q1 q2 q3 qN q1 qi q2 q1 q2
- t-1
q3 qN
βt+1(N) βt+1(3) βt+1(2) βt+1(1)
b2(ot+1) b3(ot+1) bN(ot+1)
Figure A.11 The computation of β i by summing all the successive values β j
- 1. Baum-Welch: Estimating !aij
which works out to be
ξt(i, j) = αt(i)ai jbj(ot+1)βt+1( j) PN
j=1 αt( j)βt(j)
- t+2
- t+1
αt(i)
- t-1
- t
aijbj(ot+1) si sj βt+1(j)
ξt(i, j) = P(qt = i,qt+1 = j|O,λ) first compute a probability which is similar
where We need to define to estimate aij
ξt(i, j)
ˆ aij = PT−1
t=1 ξt(i, j)
PT−1
t=1
PN
k=1 ξt(i,k)
Then,
γt( j) = P(qt = j|O,λ)
where
γt( j) = αt( j)βt(j) P(O|λ)
which works out to be
- t+1
αt(j)
- t-1
- t
sj βt(j)
- 2. Baum-Welch: Estimating !bj(vk)
We need to define to estimate bj(vk)
γt(j)
ˆ bj(vk) = PT
t=1s.t.Ot=vk γt( j)
PT
t=1 γt( j)
in Eq. 9.38 and Eq. 9.43 to re-estimate
Then, for discrete outputs State occupancy probability
Bringing it all together: Baum-Welch
Estimating HMM parameters iteratively using the EM algorithm. For each iteration, do: E step: For all time-state pairs, compute the state occupation probabilities 훾t(j) and ξt(i, j) M step: Reestimate HMM parameters, i.e. transition probabilities,
- bservation probabilities, based on the estimates derived in the E step
Baum-Welch algorithm (pseudocode)
function FORWARD-BACKWARD(observations of len T, output vocabulary V, hidden state set Q) returns HMM=(A,B) initialize A and B iterate until convergence E-step γt( j) = αt( j)βt(j) αT (qF) ∀ t and j ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT (qF) ∀ t, i, and j M-step ˆ aij =
T−1
X
t=1
ξt(i, j)
T−1
X
t=1 N
X
k=1
ξt(i,k) ˆ b j(vk) =
T
X
t=1s.t. Ot=vk
γt( j)
T
X
t=1
γt(j) return A, B
Discrete to continuous outputs
We derived Baum-Welch updates for discrete outputs. However, HMMs in acoustic models emit real-valued vectors as observations. Before we understand how Baum-Welch works for acoustic modelling using HMMs, let’s look at an overview of the Expectation Maximization (EM) algorithm and establish some notation.
Observed data: i.i.d samples xi, i=1, …, N Goal: Find where Initial parameters: θ0 (x is observed and z is hidden) Iteratively compute θl as follows:
EM Algorithm: Fitting Parameters to Data
Q(θ, θ`−1) =
N
X
i=1
X
z
Pr(z|xi; θ`−1) log Pr(xi, z; θ) θ` = arg max
✓
Q(θ, θ`−1) L(θ) =
N
X
i=1
log Pr(xi; θ) arg max
θ
L(θ) L(θ) − L(θ`−1) ≥ Q(θ, θ`−1) − Q(θ`−1, θ`−1)
Estimate θl cannot get worse over iterations because for all θ:
EM is guaranteed to converge to a local optimum or saddle points [Wu83]
Coin example to illustrate EM
- 휌1 = Pr(H)
휌2 = Pr(H) 휌3 = Pr(H) The following sequence is observed: “HH, TT, HH, TT, HH” How do you estimate 휌1, 휌2 and 휌3? Toss privately if it shows H: Toss twice else Toss twice Repeat:
Coin example to illustrate EM
Recall, for partially observed data, the log likelihood is given by: ∈
- each observation xi
where, for the coin example:
X = {HH,HT,TH,TT}
- the hidden variable
∈
z Z = {H,T} L(θ) =
N
X
i=1
log Pr(xi; θ) =
N
X
i=1
log X
z
Pr(xi, z; θ)
Coin example to illustrate EM
Recall, for partially observed data, the log likelihood is given by:
Pr(x, z; θ) = Pr(x|z; θ) Pr(z; θ)
where Pr(z; θ) =
( ρ1 if z = H 1 − ρ1 if z = T
- 휌1 = Pr(H)
- 휌2 =Pr(H)
휌3 = Pr(H)
L(θ) =
N
X
i=1
log Pr(xi; θ) =
N
X
i=1
log X
z
Pr(xi, z; θ)
h : number of heads, t : number of tails
Pr(x|z; θ) = ( ρh
2(1 − ρ2)t
if z = H ρh
3(1 − ρ3)t
if z = T
Our observed data is: {HH, TT, HH, TT, HH} Let’s use EM to estimate θ = (휌1, 휌2, 휌3) = 0.16 = 0.49 What is 훾(H, TT)? What is 훾(H, HH)? Suppose θl -1 is 휌1 = 0.3, 휌2 = 0.4, 휌3 = 0.6: [EM Iteration, E-step] Compute quantities involved in where 훾(z, x) = Pr(z | x ;θl -1)
Q(θ, θ`−1) =
N
X
i=1
X
z
γ(z, xi) log Pr(xi, z; θ)
Coin example to illustrate EM
i.e., compute 훾(z, xi) for all z and all i
Our observed data is: {HH, TT, HH, TT, HH} [EM Iteration, M-step] Find θ which maximises
Q(θ, θ`−1) =
N
X
i=1
X
z
γ(z, xi) log Pr(xi, z; θ)
Coin example to illustrate EM
ρ2 = PN
i=1 γ(H, xi)hi
PN
i=1 γ(H, xi)(hi + ti)
ρ1 = PN
i=1 γ(H, xi)
N ρ3 = PN
i=1 γ(T, xi)hi
PN
i=1 γ(T, xi)(hi + ti)
Let’s use EM to estimate θ = (휌1, 휌2, 휌3)
Coin example to illustrate EM
This was a very simple HMM (with observations from 2 states) State remains the same after the first transition γ estimated the distribution of this state More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972) (predates the general formulation of EM (1977))
H T
휌1 1-휌1 H/휌2 T/1-휌2 H/휌3 T/1-휌3 1 1
Observed data: N sequences, xi, i=1…N where xi ∈ V Parameters θ : transition matrix A, observation probabilities B [EM Iteration, E-step] Compute quantities involved in Q(θ,θl -1) 훾i,t (j) = Pr(zt = j | xi ;θl -1) 훏i,t(j,k) = Pr(zt = j, zt+1 = k | xi ;θl -1)
Baum-Welch Algorithm as EM
Parameters θ : transition matrix A, observation probabilities B [EM Iteration, M-step] Find θ which maximises Q(θ,θl -1)
Baum-Welch Algorithm as EM
Bj,v = PN
i=1
P
t:xit=v γi,t(j)
PN
i=1
PTi
t=1 γi,t(j)
Aj,k = PN
i=1
PTi1
t=1 ξi,t(j, k)
PN
i=1
PTi1
t=1
P
k0 ξi,t(j, k0)
<latexit sha1_base64="uVFnsJYIYcB5KF/iC5IN0q4U0tA=">ACY3ichVHLSgMxFM2MWrW+xsdOhGARLdQyUwXdFKpuXIlCq0KnDpk0Y2MzD5I7Yh3mJ925c+N/mD4WgUvBM495x5ucuIngiuw7XfDnJmdK8wvLBaXldW16z1jVsVp5KyFo1FLO9opjgEWsB8HuE8lI6At25/cvhvrdM5OKx1ETBgnrhOQx4gGnBDTlWa9nXvZU6e4jt1AEpq5Kg29jNed/OEKjxsYNlnT4dOjt0XruUK5AfaVs7/nx9R/f0p534596ySXbVHhX8DZwJKaFLXnvXmdmOahiwCKohSbcdOoJMRCZwKlhfdVLGE0D5ZG0NIxIy1clGeV4TzNdHMRSnwjwiP3uyEio1CD09WRIoKemtSH5l9ZOITjtZDxKUmARHS8KUoEhxsPAcZdLRkEMNCBUcn1XTHtERw36W4o6BGf6yb/Bba3qHFVrN8elxvkjgW0jXbRAXLQCWqgS3SNWoiD6NgrBmW8WkumRvm1njUNCaeTfSjzJ0vOj62Q=</latexit>Observed data: N sequences, xi, i=1…N where xi ∈ V
Discrete to continuous outputs
We derived Baum-Welch updates for discrete outputs. However, HMMs in acoustic models emit real-valued vectors as observations. Use probability density functions to define observation probabilities
If were 1D values, HMM observation probabilities: where is the mean associated with state and is its variance
x bj(x) = 풩(x|μj, σ2
j )
μj j σ2
j
If , then we use multivariate Gaussians, where is the covariance matrix associated with state j
x ∈ ℝd bj(x) = 풩(x|μj, Σj) Σj
BW for Gaussian Observation Model
Parameters θ : transition matrix A, observation prob. B = {(μj,Σj)} for all j [EM Iteration, M-step] Find θ which maximises Q(θ,θl -1)
µj = PN
i=1
PTi
t=1 γi,t(j)xit
PN
i=1
PTi
t=1 γi,t(j)
Σj = PN
i=1
PTi
t=1 γi,t(j)(xit − µj)(xit − µj)T
PN
i=1
PTi
t=1 γi,t(j)
A same as with discrete outputs Observed data: N sequences, xi = (xi1, …, xiTi), i=1…N where xit ∈ ℝd B = {(μj,Σj)} for all j
Gaussian Mixture Model
- More generally, we use a “mixture of Gaussians” to
allow for acoustic vectors associated with a state to be non-Gaussian.
- Instead of
in the single Gaussian case, can be an M-component mixture model:
bj(x) = 풩(x|μj, Σj) bj(x)
where cjm is the mixing probability for Gaussian component m of state j
M
X
m=1
cjm = 1, cjm ≥ 0
bj(x) =
M
X
m=1
cjmN(x|µjm, Σjm)
<latexit sha1_base64="ZpmSZEgz1V14OmOhizYlXlrOw=">ACRnicbVBNS8QwEJ2uX+v6terRS3ARFGRpVdCLIHrxoqzoqrCtJc2mazRpS5KS+2v8+LZmz/BiwdFvJrWPfg1EPJ47w0z84KEM6Vt+8mqDA2PjI5Vx2sTk1PTM/XZuVMVp5LQNol5LM8DrChnEW1rpjk9TyTFIuD0LjeK/SzGyoVi6MT3U+oJ3AvYiEjWBvKr3uBf7XsCqwvgzC7zVfQNnJVKvxMbDv5xQEifnYlclQ6CObZYf7NfecGMe+qvjCfK9LSuoes57AJV7x6w27aZeF/gJnABowqJZf3S7MUkFjThWKmOYyfay7DUjHCa19xU0QSTa9yjHQMjLKjysjKGHC0ZpovCWJoXaVSy3zsyLFSxrHEWJ6jfWkH+p3VSHW5GYuSVNOIfA0KU450jIpMUZdJSjTvG4CJZGZXRC6xESb5GsmBOf3yX/B6VrTW+uHW0dnYHcVRhARZhGRzYhB3Yhxa0gcA9PMrvFkP1ov1bn18WSvWoGceflQFPgGPU7MV</latexit>- Assuming that observations associated with a state
follow a Gaussian distribution is too simplistic.
BW for Gaussian Mixture Model
Parameters θ : transition matrix A, observation prob. B = {(μjm,Σjm,cjm)} for all j,m [EM Iteration, M-step] Find θ which maximises Q(θ,θl -1)
µjm = PN
i=1
PTi
t=1 γi,t(j, m)xit
PN
i=1
PTi
t=1 γi,t(j, m)
Σjm = PN
i=1
PTi
t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T
PN
i=1
PTi
t=1 γi,t(j, m)
cjm = PN
i=1
PTi
t=1 γi,t(j, m)
PN
i=1
PTi
t=1
PM
m0=1 γi,t(j, m0)
<latexit sha1_base64="CDwhDXmujKSHp4qwkxirtQ7FuFA=">ACaHichVFdS8MwFE3r9/yqX4j4Ehyigox2CvoiL74oig4FdZ0iydcUlbklthlOJ/9M0f4Iu/wnTbgzrBC4Fz7mHm5yEqeAaXPfdsfGJyanpmcqs3PzC4vO0vKdTjJFWYMmIlEPIdFM8Jg1gINgD6liRIaC3Yfd81K/f2FK8yS+hV7KWpJ0Yh5xSsBQgfNKg/xZFvgE+5EiNPd1JoOcn3jF4xUeNFA2+W3AC+x3iJTE6PtQ7D7vy73iP0OfkDslczli39krAqfq1tx+4VHgDUEVDes6cN78dkIzyWKgmjd9NwUWjlRwKlgRcXPNEsJ7ZIOaxoYE8l0K+8HVeBtw7RxlChzYsB9rsjJ1LrngzNpCTwpH9rJfmX1swgOm7lPE4zYDEdLIoygSHBZeq4zRWjIHoGEKq4uSumT8QEDuZvKiYE7/eTR8FdveYd1Oo3h9XTs2Ec02gTbaFd5KEjdIou0DVqIo+rFlr1VqzPm3HXrc3BqO2NfSsoB9lb30BIRG5Eg=</latexit>Observed data: N sequences, xi = (xi1, …, xiTi), i=1…N where xit ∈ ℝd
- Prob. of component m
- f state j at time t
B = {(μjm,Σjm,cjm)} for all j,m
Baum Welch: In summary
[Every EM Iteration] Compute θ = { Ajk, (μjm,Σjm,cjm) } for all j,k,m
µjm = PN
i=1
PTi
t=1 γi,t(j, m)xit
PN
i=1
PTi
t=1 γi,t(j, m)
Σjm = PN
i=1
PTi
t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T
PN
i=1
PTi
t=1 γi,t(j, m)
cjm = PN
i=1
PTi
t=1 γi,t(j, m)
PN
i=1
PTi
t=1
PM
m0=1 γi,t(j, m0)
<latexit sha1_base64="CDwhDXmujKSHp4qwkxirtQ7FuFA=">ACaHichVFdS8MwFE3r9/yqX4j4Ehyigox2CvoiL74oig4FdZ0iydcUlbklthlOJ/9M0f4Iu/wnTbgzrBC4Fz7mHm5yEqeAaXPfdsfGJyanpmcqs3PzC4vO0vKdTjJFWYMmIlEPIdFM8Jg1gINgD6liRIaC3Yfd81K/f2FK8yS+hV7KWpJ0Yh5xSsBQgfNKg/xZFvgE+5EiNPd1JoOcn3jF4xUeNFA2+W3AC+x3iJTE6PtQ7D7vy73iP0OfkDslczli39krAqfq1tx+4VHgDUEVDes6cN78dkIzyWKgmjd9NwUWjlRwKlgRcXPNEsJ7ZIOaxoYE8l0K+8HVeBtw7RxlChzYsB9rsjJ1LrngzNpCTwpH9rJfmX1swgOm7lPE4zYDEdLIoygSHBZeq4zRWjIHoGEKq4uSumT8QEDuZvKiYE7/eTR8FdveYd1Oo3h9XTs2Ec02gTbaFd5KEjdIou0DVqIo+rFlr1VqzPm3HXrc3BqO2NfSsoB9lb30BIRG5Eg=</latexit>Aj,k = PN
i=1
PTi
t=2 ξi,t(j, k)
PN
i=1
PTi
t=2
P
k0 ξi,t(j, k0)