Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden Markov Models (Part III) Instructor: Preethi Jyothi Aug 14, 2017 Recap: Learning HMM Parameters Given an HMM = ( A , B ) and an observation se-


slide-1
SLIDE 1

Instructor: Preethi Jyothi Aug 14, 2017


Automatic Speech Recognition (CS753)

Lecture 7: Hidden Markov Models (Part III)

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Recap: Learning HMM Parameters

Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation se- quence O, determine the likelihood P(O|λ). Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.

Standard algorithm for HMM training: Forward-backward or 
 Baum-Welch algorithm

slide-3
SLIDE 3

Baum Welch: In summary

[Every EM Iteration]
 Compute θ = { Ajk, (µjm,Σjm,cjm) } for all j,k,m

µjm = PN

i=1

PTi

t=1 γi,t(j, m)xit

PN

i=1

PTi

t=1 γi,t(j, m)

Σjm = PN

i=1

PTi

t=1 γi,t(j, m)(xit − µjm)(xit − µjm)T

PN

i=1

PTi

t=1 γi,t(j, m)

cjm = PN

i=1

PTi

t=1 γi,t(j, m)

PN

i=1

PTi

t=1 γi,t(j)

Aj,k = PN

i=1

PTi

t=2 ξi,t(j, k)

PN

i=1

PTi

t=2

P

k0 ξi,t(j, k0)

How do we efficiently compute 𝛿t(j) and ξt(i, j)?

slide-4
SLIDE 4

Forward/Backward Probabilities

Require two probabilities to compute estimates for the transition and observation probabilities

  • 1. Forward probability: Recall
  • 2. Backward probability:

αt(j) = P(o1,o2 ...ot,qt = j|λ) βt(i) = P(ot+1,ot+2 ...oT|qt = i,λ)

slide-5
SLIDE 5

Backward probability

  • 1. Initialization:

βT(i) = aiF, 1 ≤ i ≤ N

  • 2. Recursion (again since states 0 and qF are non-emitting):

βt(i) =

N

X

j=1

aij bj(ot+1) βt+1( j), 1 ≤ i ≤ N,1 ≤ t < T

  • 3. Termination:

P(O|λ) = αT(qF) = β1(q0) =

N

X

j=1

a0j bj(o1) β1( j)

slide-6
SLIDE 6
  • 1. Baum-Welch: Estimating aij

ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT(qF)

which works out to be

  • t+2
  • t+1

αt(i)

  • t-1
  • t

aijbj(ot+1) si sj βt+1(j)

ξt(i, j) = P(qt = i,qt+1 = j|O,λ) first compute a probability which is similar

where We need to define to estimate aij

ξt(i, j)

ˆ aij = PT−1

t=1 ξt(i, j)

PT−1

t=1

PN

k=1 ξt(i,k)

Then,

slide-7
SLIDE 7

γt(j) = P(qt = j|O,λ)

where

γt( j) = αt( j)βt( j) P(O|λ)

which works out to be

  • t+1

αt(j)

  • t-1
  • t

sj βt(j)

  • 2. Baum-Welch: Estimating bi(ot)

We need to define to estimate bi(ot)

γt(j)

ˆ bj(vk) = PT

t=1s.t.Ot=vk γt(j)

PT

t=1 γt(j)

in Eq. 9.38 and Eq. 9.43 to re-estimate

Then, for discrete outputs

slide-8
SLIDE 8

Baum-Welch algorithm (pseudocode)

function FORWARD-BACKWARD(observations of len T, output vocabulary V, hidden state set Q) returns HMM=(A,B) initialize A and B iterate until convergence E-step γt( j) = αt( j)βt( j) αT (qF) ∀ t and j ξt(i, j) = αt(i)ai jb j(ot+1)βt+1( j) αT (qF) ∀ t, i, and j M-step ˆ ai j =

T−1

X

t=1

ξt(i, j)

T−1

X

t=1 N

X

k=1

ξt(i,k) ˆ b j(vk) =

T

X

t=1s.t. Ot=vk

γt( j)

T

X

t=1

γt( j) return A, B

slide-9
SLIDE 9

ASR Framework: Acoustic Models

Acoustic
 Features

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H

  • Acoustic models are estimated using training data: {xi, yi}, i=1…N


where xi corresponds to a sequence of acoustic feature vectors
 and yi corresponds to a sequence of words

“Hello world” “sil hh ah l ow w er l d sil” “sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil”

  • For each xi, yi, a composite HMM is constructed using the HMMs that

correspond to the triphone sequence in yi

slide-10
SLIDE 10

ASR Framework: Acoustic Models

Acoustic
 Features

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H

  • Acoustic models are estimated using training data: {xi, yi}, i=1…N


where xi corresponds to a sequence of acoustic feature vectors
 and yi corresponds to a sequence of words

  • For each xi, yi, a composite HMM is constructed using the HMMs that

correspond to the triphone sequence in yi

  • These parameters are fit to the acoustic data {xi}, i=1…N using the

Baum-Welch algorithm (EM)

  • Parameters of these composite HMMs are the parameters of the 


constituent triphone HMMs.

slide-11
SLIDE 11

Triphone HMM Models

  • Each phone is modelled in the context of its lefu and right neighbour

phones

  • Pronunciation of a phone is influenced by the preceding and

succeeding phones. E.g. The phone [p] in the word “peek” : p iy k” vs. [p] in the word “pool” : p uw l

  • Number of triphones that appear in data ≈ 1000s or 10,000s
  • If each triphone HMM has 3 states and each state generates m-

component GMMs (m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters

  • Hundreds of millions of parameters! 

  • Insufficient data to learn all triphone models reliably. What do we do?

Share parameters across triphone models!

slide-12
SLIDE 12

Parameter Sharing

  • Sharing of parameters (also referred to as “parameter tying”) can be

done at any level:

  • Parameters in HMMs corresponding to two triphones are said to be

tied if they are identical

t2 t4 t1 t3 t5 t’2 t’4 t’1 t’3 t’5

Transition probs 
 are tied i.e. t’i = ti State observation densities 
 are tied

  • More parameter tying: Tying variances of all Gaussians within a state,


tying variances of all Gaussians in all states, tying individual Gaussians, etc.

slide-13
SLIDE 13
  • 1. Tied Mixture Models
  • All states share the same Gaussians (i.e. same means and

covariances)

  • Mixture weights are specific to each state

Triphone HMMs (No sharing) Triphone HMMs (Tied Mixture Models)

slide-14
SLIDE 14
  • 2. State Tying
  • Observation probabilities are shared across states which

generate acoustically similar data

Triphone HMMs (No sharing)

p/a/k b/a/g b/a/k

Triphone HMMs (State Tying)

p/a/k b/a/g b/a/k

slide-15
SLIDE 15

Tied state HMMs

Four main steps in building a tied state HMM system:

  • 1. Create and train 3-state monophone

HMMs with single Gaussian

  • bservation probability densities
  • 2. Clone these monophone distributions

to initialise a set of untied triphone

  • models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

  • f each phone.
  • 3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

  • 4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

slide-16
SLIDE 16

Tied state HMMs

Four main steps in building a tied state HMM system:

  • 1. Create and train 3-state monophone

HMMs with single Gaussian

  • bservation probability densities
  • 2. Clone these monophone distributions

to initialise a set of untied triphone

  • models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

  • f each phone.
  • 3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

  • 4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Which states should be tied together? Use decision trees.

slide-17
SLIDE 17

Decision Trees

Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to different branches

Shape?

Oval Leafy Cylindrical

Taste? Color?

Spinach

Sour Neutral

T

  • nato

Color?

Green White

Snakegovrd T usnip Radish

White

Brinjal

Purple

slide-18
SLIDE 18

Decision Trees

  • Given the data at a node, either declare the node to be a leaf
  • r find another property to split the node into branches.
  • Important questions to be addressed for DTs:
  • 1. How many splits at a node? Chosen by the user.
  • 2. Which property should be used at a node for splituing?

One which decreases “impurity” of nodes as much as possible.

  • 3. When is a node a leaf? Set threshold in reduction in

impurity

slide-19
SLIDE 19

Tied state HMMs

Four main steps in building a tied state HMM system:

  • 1. Create and train 3-state monophone

HMMs with single Gaussian

  • bservation probability densities
  • 2. Clone these monophone distributions

to initialise a set of untied triphone

  • models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones

  • f each phone.
  • 3. For all triphones derived from the

same monophone, cluster states whose parameters should be tied together.

  • 4. Number of mixture components in

each tied state is increased and models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Which states should be tied together? Use decision trees.