Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden Markov Models (Part III) Instructor: Preethi Jyothi Aug 14, 2017 Recap: Learning HMM Parameters Given an HMM = ( A , B ) and an observation se-
Recap: Learning HMM Parameters
Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation se- quence O, determine the likelihood P(O|λ). Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.
Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm
Baum Welch: In summary
[Every EM Iteration] Compute θ = { Ajk, (µjm,Σjm,cjm) } for all j,k,m
µjm = PN
i=1
PTi
t=1 γi,t(j, m)xit
PN
i=1
PTi
t=1 γi,t(j, m)
Σjm = PN
i=1
PTi
t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T
PN
i=1
PTi
t=1 γi,t(j, m)
cjm = PN
i=1
PTi
t=1 γi,t(j, m)
PN
i=1
PTi
t=1 γi,t(j)
Aj,k = PN
i=1
PTi
t=2 ξi,t(j, k)
PN
i=1
PTi
t=2
P
k0 ξi,t(j, k0)
How do we efficiently compute 𝛿t(j) and ξt(i, j)?
Forward/Backward Probabilities
Require two probabilities to compute estimates for the transition and observation probabilities
- 1. Forward probability: Recall
- 2. Backward probability:
αt(j) = P(o1,o2 ...ot,qt = j|λ) βt(i) = P(ot+1,ot+2 ...oT|qt = i,λ)
Backward probability
- 1. Initialization:
βT(i) = aiF, 1 ≤ i ≤ N
- 2. Recursion (again since states 0 and qF are non-emitting):
βt(i) =
N
X
j=1
aij bj(ot+1) βt+1( j), 1 ≤ i ≤ N,1 ≤ t < T
- 3. Termination:
P(O|λ) = αT(qF) = β1(q0) =
N
X
j=1
a0j bj(o1) β1( j)
- 1. Baum-Welch: Estimating aij
ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT(qF)
which works out to be
- t+2
- t+1
αt(i)
- t-1
- t
aijbj(ot+1) si sj βt+1(j)
ξt(i, j) = P(qt = i,qt+1 = j|O,λ) first compute a probability which is similar
where We need to define to estimate aij
ξt(i, j)
ˆ aij = PT−1
t=1 ξt(i, j)
PT−1
t=1
PN
k=1 ξt(i,k)
Then,
γt(j) = P(qt = j|O,λ)
where
γt( j) = αt( j)βt( j) P(O|λ)
which works out to be
- t+1
αt(j)
- t-1
- t
sj βt(j)
- 2. Baum-Welch: Estimating bi(ot)
We need to define to estimate bi(ot)
γt(j)
ˆ bj(vk) = PT
t=1s.t.Ot=vk γt(j)
PT
t=1 γt(j)
in Eq. 9.38 and Eq. 9.43 to re-estimate
Then, for discrete outputs
Baum-Welch algorithm (pseudocode)
function FORWARD-BACKWARD(observations of len T, output vocabulary V, hidden state set Q) returns HMM=(A,B) initialize A and B iterate until convergence E-step γt( j) = αt( j)βt( j) αT (qF) ∀ t and j ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT (qF) ∀ t, i, and j M-step ˆ ai j =
T−1
X
t=1
ξt(i, j)
T−1
X
t=1 N
X
k=1
ξt(i,k) ˆ b j(vk) =
T
X
t=1s.t. Ot=vk
γt( j)
T
X
t=1
γt( j) return A, B
ASR Framework: Acoustic Models
Acoustic Features
Language Model
Word Sequence
Acoustic Models Triphones Context Transducer Monophones Pronunciation Model Words
H
- Acoustic models are estimated using training data: {xi, yi}, i=1…N
where xi corresponds to a sequence of acoustic feature vectors and yi corresponds to a sequence of words
“Hello world” “sil hh ah l ow w er l d sil” “sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil”
- For each xi, yi, a composite HMM is constructed using the HMMs that
correspond to the triphone sequence in yi
ASR Framework: Acoustic Models
Acoustic Features
Language Model
Word Sequence
Acoustic Models Triphones Context Transducer Monophones Pronunciation Model Words
H
- Acoustic models are estimated using training data: {xi, yi}, i=1…N
where xi corresponds to a sequence of acoustic feature vectors and yi corresponds to a sequence of words
- For each xi, yi, a composite HMM is constructed using the HMMs that
correspond to the triphone sequence in yi
- These parameters are fit to the acoustic data {xi}, i=1…N using the
Baum-Welch algorithm (EM)
- Parameters of these composite HMMs are the parameters of the
constituent triphone HMMs.
Triphone HMM Models
- Each phone is modelled in the context of its lefu and right neighbour
phones
- Pronunciation of a phone is influenced by the preceding and
succeeding phones. E.g. The phone [p] in the word “peek” : p iy k” vs. [p] in the word “pool” : p uw l
- Number of triphones that appear in data ≈ 1000s or 10,000s
- If each triphone HMM has 3 states and each state generates m-
component GMMs (m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters
- Hundreds of millions of parameters!
- Insufficient data to learn all triphone models reliably. What do we do?
Share parameters across triphone models!
Parameter Sharing
- Sharing of parameters (also referred to as “parameter tying”) can be
done at any level:
- Parameters in HMMs corresponding to two triphones are said to be
tied if they are identical
t2 t4 t1 t3 t5 t’2 t’4 t’1 t’3 t’5
Transition probs are tied i.e. t’i = ti State observation densities are tied
- More parameter tying: Tying variances of all Gaussians within a state,
tying variances of all Gaussians in all states, tying individual Gaussians, etc.
- 1. Tied Mixture Models
- All states share the same Gaussians (i.e. same means and
covariances)
- Mixture weights are specific to each state
Triphone HMMs (No sharing) Triphone HMMs (Tied Mixture Models)
- 2. State Tying
- Observation probabilities are shared across states which
generate acoustically similar data
Triphone HMMs (No sharing)
p/a/k b/a/g b/a/k
Triphone HMMs (State Tying)
p/a/k b/a/g b/a/k
Tied state HMMs
Four main steps in building a tied state HMM system:
- 1. Create and train 3-state monophone
HMMs with single Gaussian
- bservation probability densities
- 2. Clone these monophone distributions
to initialise a set of untied triphone
- models. Train them using Baum-
Welch estimation. Transition matrix remains common across all triphones
- f each phone.
- 3. For all triphones derived from the
same monophone, cluster states whose parameters should be tied together.
- 4. Number of mixture components in
each tied state is increased and models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs
Four main steps in building a tied state HMM system:
- 1. Create and train 3-state monophone
HMMs with single Gaussian
- bservation probability densities
- 2. Clone these monophone distributions
to initialise a set of untied triphone
- models. Train them using Baum-
Welch estimation. Transition matrix remains common across all triphones
- f each phone.
- 3. For all triphones derived from the
same monophone, cluster states whose parameters should be tied together.
- 4. Number of mixture components in
each tied state is increased and models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Which states should be tied together? Use decision trees.
Decision Trees
Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to different branches
Shape?
Oval Leafy Cylindrical
Taste? Color?
Spinach
Sour Neutral
T
- nato
Color?
Green White
Snakegovrd T usnip Radish
White
Brinjal
Purple
Decision Trees
- Given the data at a node, either declare the node to be a leaf
- r find another property to split the node into branches.
- Important questions to be addressed for DTs:
- 1. How many splits at a node? Chosen by the user.
- 2. Which property should be used at a node for splituing?
One which decreases “impurity” of nodes as much as possible.
- 3. When is a node a leaf? Set threshold in reduction in
impurity
Tied state HMMs
Four main steps in building a tied state HMM system:
- 1. Create and train 3-state monophone
HMMs with single Gaussian
- bservation probability densities
- 2. Clone these monophone distributions
to initialise a set of untied triphone
- models. Train them using Baum-
Welch estimation. Transition matrix remains common across all triphones
- f each phone.
- 3. For all triphones derived from the
same monophone, cluster states whose parameters should be tied together.
- 4. Number of mixture components in
each tied state is increased and models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Which states should be tied together? Use decision trees.