[PPT] - Automatic Speech Recognition (CS753) Automatic Speech Recognition PowerPoint Presentation

SLIDE 1

Instructor: Preethi Jyothi Aug 14, 2017 

Automatic Speech Recognition (CS753)

Lecture 7: Hidden Markov Models (Part III)

Automatic Speech Recognition (CS753)

SLIDE 2

Recap: Learning HMM Parameters

Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation se- quence O, determine the likelihood P(O|λ). Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.

Standard algorithm for HMM training: Forward-backward or   Baum-Welch algorithm

SLIDE 3

Baum Welch: In summary

[Every EM Iteration]  Compute θ = { Ajk, (µjm,Σjm,cjm) } for all j,k,m

µjm = PN

i=1

PTi

t=1 γi,t(j, m)xit

PN

i=1

PTi

t=1 γi,t(j, m)

Σjm = PN

i=1

PTi

t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T

PN

i=1

PTi

t=1 γi,t(j, m)

cjm = PN

i=1

PTi

t=1 γi,t(j, m)

PN

i=1

PTi

t=1 γi,t(j)

Aj,k = PN

i=1

PTi

t=2 ξi,t(j, k)

PN

i=1

PTi

t=2

P

k0 ξi,t(j, k0)

How do we efficiently compute 𝛿t(j) and ξt(i, j)?

SLIDE 4

Forward/Backward Probabilities

Require two probabilities to compute estimates for the transition and observation probabilities

1. Forward probability: Recall
2. Backward probability:

αt(j) = P(o1,o2 ...ot,qt = j|λ) βt(i) = P(ot+1,ot+2 ...oT|qt = i,λ)

SLIDE 5

Backward probability

1. Initialization:

βT(i) = aiF, 1 ≤ i ≤ N

2. Recursion (again since states 0 and qF are non-emitting):

βt(i) =

N

X

j=1

aij bj(ot+1) βt+1( j), 1 ≤ i ≤ N,1 ≤ t < T

3. Termination:

P(O|λ) = αT(qF) = β1(q0) =

N

X

j=1

a0j bj(o1) β1( j)

SLIDE 6

1. Baum-Welch: Estimating aij

ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT(qF)

which works out to be

t+2
t+1

αt(i)

t-1
t

aijbj(ot+1) si sj βt+1(j)

ξt(i, j) = P(qt = i,qt+1 = j|O,λ) first compute a probability which is similar

where We need to define to estimate aij

ξt(i, j)

ˆ aij = PT−1

t=1 ξt(i, j)

PT−1

t=1

PN

k=1 ξt(i,k)

Then,

SLIDE 7

γt(j) = P(qt = j|O,λ)

where

γt( j) = αt( j)βt( j) P(O|λ)

which works out to be

t+1

αt(j)

t-1
t

sj βt(j)

2. Baum-Welch: Estimating bi(ot)

We need to define to estimate bi(ot)

γt(j)

ˆ bj(vk) = PT

t=1s.t.Ot=vk γt(j)

PT

t=1 γt(j)

in Eq. 9.38 and Eq. 9.43 to re-estimate

Then, for discrete outputs

SLIDE 8

Baum-Welch algorithm (pseudocode)

function FORWARD-BACKWARD(observations of len T, output vocabulary V, hidden state set Q) returns HMM=(A,B) initialize A and B iterate until convergence E-step γt( j) = αt( j)βt( j) αT (qF) ∀ t and j ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT (qF) ∀ t, i, and j M-step ˆ ai j =

T−1

X

t=1

ξt(i, j)

T−1

X

t=1 N

X

k=1

ξt(i,k) ˆ b j(vk) =

T

X

t=1s.t. Ot=vk

γt( j)

T

X

t=1

γt( j) return A, B

SLIDE 9

ASR Framework: Acoustic Models

Acoustic  Features

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H

Acoustic models are estimated using training data: {xi, yi}, i=1…N

where xi corresponds to a sequence of acoustic feature vectors  and yi corresponds to a sequence of words

“Hello world” “sil hh ah l ow w er l d sil” “sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil”

For each xi, yi, a composite HMM is constructed using the HMMs that

correspond to the triphone sequence in yi

SLIDE 10

ASR Framework: Acoustic Models

Acoustic  Features

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H

Acoustic models are estimated using training data: {xi, yi}, i=1…N

where xi corresponds to a sequence of acoustic feature vectors  and yi corresponds to a sequence of words

For each xi, yi, a composite HMM is constructed using the HMMs that

correspond to the triphone sequence in yi

These parameters are fit to the acoustic data {xi}, i=1…N using the

Baum-Welch algorithm (EM)

Parameters of these composite HMMs are the parameters of the

constituent triphone HMMs.

SLIDE 11

Triphone HMM Models

Each phone is modelled in the context of its lefu and right neighbour

phones

Pronunciation of a phone is influenced by the preceding and

succeeding phones. E.g. The phone [p] in the word “peek” : p iy k” vs. [p] in the word “pool” : p uw l

Number of triphones that appear in data ≈ 1000s or 10,000s
If each triphone HMM has 3 states and each state generates m-

component GMMs (m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters

Hundreds of millions of parameters!  
Insufficient data to learn all triphone models reliably. What do we do?

Share parameters across triphone models!

SLIDE 12

Parameter Sharing

Sharing of parameters (also referred to as “parameter tying”) can be

done at any level:

Parameters in HMMs corresponding to two triphones are said to be

tied if they are identical

t2 t4 t1 t3 t5 t’2 t’4 t’1 t’3 t’5

Transition probs   are tied i.e. t’i = ti State observation densities   are tied

More parameter tying: Tying variances of all Gaussians within a state,

tying variances of all Gaussians in all states, tying individual Gaussians, etc.

SLIDE 13

1. Tied Mixture Models
All states share the same Gaussians (i.e. same means and

covariances)

Mixture weights are specific to each state

Triphone HMMs (No sharing) Triphone HMMs (Tied Mixture Models)

SLIDE 14

2. State Tying
Observation probabilities are shared across states which

generate acoustically similar data

Triphone HMMs (No sharing)

p/a/k b/a/g b/a/k

Triphone HMMs (State Tying)

p/a/k b/a/g b/a/k

SLIDE 15

Tied state HMMs

Four main steps in building a tied state HMM system:

1. Create and train 3-state monophone

HMMs with single Gaussian

bservation probability densities
2. Clone these monophone distributions

to initialise a set of untied triphone

models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

f each phone.
3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

SLIDE 16

Tied state HMMs

Four main steps in building a tied state HMM system:

1. Create and train 3-state monophone

HMMs with single Gaussian

bservation probability densities
2. Clone these monophone distributions

to initialise a set of untied triphone

models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

f each phone.
3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Which states should be tied together? Use decision trees.

SLIDE 17

Decision Trees

Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to different branches

Shape?

Oval Leafy Cylindrical

Taste? Color?

Spinach

Sour Neutral

T

nato

Color?

Green White

Snakegovrd T usnip Radish

White

Brinjal

Purple

SLIDE 18

Decision Trees

Given the data at a node, either declare the node to be a leaf
r find another property to split the node into branches.
Important questions to be addressed for DTs:
1. How many splits at a node? Chosen by the user.
2. Which property should be used at a node for splituing?

One which decreases “impurity” of nodes as much as possible.

3. When is a node a leaf? Set threshold in reduction in

impurity

SLIDE 19

Tied state HMMs

Four main steps in building a tied state HMM system:

1. Create and train 3-state monophone

HMMs with single Gaussian

bservation probability densities
2. Clone these monophone distributions

to initialise a set of untied triphone

models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

f each phone.
3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Which states should be tied together? Use decision trees.

Instructor: Preethi Jyothi Aug 14, 2017

Automatic Speech Recognition (CS753)

Lecture 7: Hidden Markov Models (Part III)

Automatic Speech Recognition (CS753)

Recap: Learning HMM Parameters

Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm

Baum Welch: In summary

[Every EM Iteration] Compute θ = { Ajk, (µjm,Σjm,cjm) } for all j,k,m

How do we efficiently compute 𝛿t(j) and ξt(i, j)?

Forward/Backward Probabilities

Require two probabilities to compute estimates for the transition and observation probabilities

Backward probability

which works out to be

where We need to define to estimate aij

ξt(i, j)

ˆ aij = PT−1

PT−1

PN

Then,

where

which works out to be

We need to define to estimate bi(ot)

γt(j)

ˆ bj(vk) = PT

PT

in Eq. 9.38 and Eq. 9.43 to re-estimate

Then, for discrete outputs

Baum-Welch algorithm (pseudocode)

ASR Framework: Acoustic Models

H

where xi corresponds to a sequence of acoustic feature vectors and yi corresponds to a sequence of words

correspond to the triphone sequence in yi

ASR Framework: Acoustic Models

H

where xi corresponds to a sequence of acoustic feature vectors and yi corresponds to a sequence of words

correspond to the triphone sequence in yi

Baum-Welch algorithm (EM)

constituent triphone HMMs.

Triphone HMM Models

phones

succeeding phones. E.g. The phone [p] in the word “peek” : p iy k” vs. [p] in the word “pool” : p uw l

component GMMs (m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters

Share parameters across triphone models!

Parameter Sharing

t2 t4 t1 t3 t5 t’2 t’4 t’1 t’3 t’5

covariances)

generate acoustically similar data

Tied state HMMs

HMMs with single Gaussian

to initialise a set of untied triphone

Welch estimation. Transition matrix remains common across all triphones

same monophone, cluster states whose parameters should be tied together.

each tied state is increased and models re-estimated using BW

Tied state HMMs

HMMs with single Gaussian

to initialise a set of untied triphone

Welch estimation. Transition matrix remains common across all triphones

same monophone, cluster states whose parameters should be tied together.

each tied state is increased and models re-estimated using BW

Decision Trees

Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to different branches

Decision Trees

One which decreases “impurity” of nodes as much as possible.

impurity

Tied state HMMs

HMMs with single Gaussian

to initialise a set of untied triphone

Welch estimation. Transition matrix remains common across all triphones

same monophone, cluster states whose parameters should be tied together.

each tied state is increased and models re-estimated using BW

Instructor: Preethi Jyothi Aug 14, 2017 

Standard algorithm for HMM training: Forward-backward or   Baum-Welch algorithm

[Every EM Iteration]  Compute θ = { Ajk, (µjm,Σjm,cjm) } for all j,k,m

where xi corresponds to a sequence of acoustic feature vectors  and yi corresponds to a sequence of words

where xi corresponds to a sequence of acoustic feature vectors  and yi corresponds to a sequence of words