EE E6820: Speech & Audio Processing & Recognition Lecture - - PowerPoint PPT Presentation

ee e6820 speech audio processing recognition lecture 10
SMART_READER_LITE
LIVE PREVIEW

EE E6820: Speech & Audio Processing & Recognition Lecture - - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis


slide-1
SLIDE 1

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 10: ASR: Sequence Recognition

Signal template matching Statistical sequence recognition Acoustic modeling The Hidden Markov Model (HMM)

Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/

1 2 3 4

slide-2
SLIDE 2

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2

Signal template matching

  • Framewise comparison of unknown word and

stored templates:

  • distance metric?
  • comparison between templates?
  • constraints?

1

10 20 30 40 50 time /frames 10 20 30 40 50 60 70

Reference Test

ONE TWO THREE FOUR FIVE

slide-3
SLIDE 3

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3

Dynamic Time Warp (DTW)

  • Find lowest-cost constrained path:
  • matrix d(i,j) of distances between input frame f

i

and reference frame r

j

  • allowable predecessors & transition costs T

xy

  • Best path via traceback from final state
  • have to store predecessors for (almost) every (i,j)

Input frames fi Reference frames rj D(i,j) = d(i,j) + min{ } D(i-1,j) + T10 D(i,j-1) + T01 D(i-1,j-1) + T11

D(i-1,j) D(i-1,j) D(i-1,j) T10 T01 T

1 1

Local match cost Lowest cost to (i,j) Best predecessor (including transition cost)

slide-4
SLIDE 4

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4

DTW-based recognition

  • Reference templates for each possible word
  • Isolated word:
  • mark endpoints of input word
  • calculate scores through each template (+prune)
  • choose best
  • Continuous speech
  • one matrix of template slices;

special-case constraints at word ends

Reference Input frames

ONE TWO THREE FOUR

slide-5
SLIDE 5

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5

DTW-based recognition (2)

+ Successfully handles timing variation + Able to recognize speech at reasonable cost

  • Distance metric?
  • pseudo-Euclidean space?
  • Warp penalties?
  • How to choose templates?
  • several templates per word?
  • choose ‘most representative’?
  • align and average?

→ need a rigorous foundation...

slide-6
SLIDE 6

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6

Outline

Signal template matching Statistical sequence recognition

  • state-based modeling

Acoustic modeling The Hidden Markov Model (HMM) 1 2 3 4

slide-7
SLIDE 7

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7

Statistical sequence recognition

  • DTW limited because it’s hard to optimize
  • interpretation of distance, transition costs?
  • Need a theoretical foundation: Probability
  • Formulate as MAP choice among models:
  • X

= observed features

  • M

j

= word-sequence models

  • Θ

= all current parameters

2

M* p M j X Θ , ( ) M j argmax =

slide-8
SLIDE 8

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8

Statistical formulation (2)

  • Can rearrange via Bayes’ rule (& drop

p ( X ) ):

  • p

( X | M

j

) = likelihood of obs’v’ns under model

  • p

( M

j

) = prior probability of model

  • Θ

A

= acoustics-related model parameters

  • Θ

L

= language-related model parameters

  • Questions:
  • what form of model to use for

?

  • how to find

Θ

A

(training)?

  • how to solve for

M

j

(decoding)?

M* p M j X Θ , ( ) M j argmax = p X M j ΘA , ( )p M j ΘL ( ) M j argmax = p X M j ΘA , ( )

slide-9
SLIDE 9

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9

State-based modeling

  • Assume discrete-state model for the speech:
  • observations are divided up into time frames
  • model

→ states →

  • bservations:
  • Probability of observations given model is:
  • sum over all possible state sequences

Q

k

  • How do observations depend on states?

How do state sequences depend on model?

q1 Qk : q2 q3 q4 q5 q6 ... x1 X1 : x2 x3 x4 x5 x6 ...

time states N

  • bserved feature

vectors Model Mj

p X M j ( ) p X1

N Qk M j

, ( ) p Qk M j ( ) ⋅

all Qk

=

slide-10
SLIDE 10

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10

The speech recognition chain

  • After classification, still have problem of

classifying the sequences of frames:

  • Questions
  • what to use for the acoustic classifier?
  • how to represent ‘model’ sequences?
  • how to score matches?

Feature calculation sound Acoustic classifier feature vectors Network weights HMM decoder phone probabilities phone & word labeling Word models Language model

slide-11
SLIDE 11

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11

Outline

Signal template matching Statistical sequence recognition Acoustic modeling

  • defining targets
  • neural networks & Gaussian models

The Hidden Markov Model (HMM) 1 2 3 4

slide-12
SLIDE 12

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12

Acoustic Modeling

  • Goal: Convert features into probabilities of

particular labels: i.e find

  • ver some state set {qi}
  • conventional statistical classification problem
  • Classifier construction is data-driven
  • assume we can get examples of known good Xs

for each of the qis

  • calculate model parameters by standard training

scheme

  • Various classifiers can be used
  • GMMs model distribution under each state
  • Neural Nets directly estimate posteriors
  • Different classifiers have different properties
  • features, labels limit ultimate performance

3

p qn

i Xn

( )

slide-13
SLIDE 13

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13

Defining classifier targets

  • Choice of {qi} can make a big difference
  • must support recognition task
  • must be a practical classification task
  • Hand-labeling is one source...
  • ‘experts’ mark spectrogram boundaries
  • ...Forced alignment is another
  • ‘best guess’ with existing classifiers, given words
  • Result is targets for each training frame:

Feature vectors Training targets g

time

w eh n

slide-14
SLIDE 14

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14

Forced alignment

  • Best labeling given existing classifier

constrained by known word sequence

Feature vectors Phone posterior probabilities Known word sequence Training targets

time

  • w

th r iy n s

  • w th r iy ...
  • w th r

iy Existing classifier Constrained alignment Dictionary Classifier training

slide-15
SLIDE 15

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15

Gaussian Mixture Models vs. Neural Nets

  • GMMs fit distribution of features under states:
  • separate ‘likelihood’ model for each state qi
  • match any distribution given enough data
  • Neural nets estimate posteriors directly
  • parameters set to discriminate classes
  • Posteriors & likelihoods related by Bayes’ rule:

p x qk ( ) 1 2π ( )

d Σk 1 2 ⁄

  • 1

2

  • x

µ µ µ µk – ( )TΣk

1 –

x µk – ( ) – exp ⋅ = p qk x ( ) F w jk F wijxi

j

[ ] ⋅

j

[ ] = p qk x ( ) p x qk ( ) Pr qk ( ) ⋅ p x q

j

( ) Pr q

j

( ) ⋅

j

  • =
slide-16
SLIDE 16

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16

Outline

Signal template matching Statistical sequence recognition Acoustic classification The Hidden Markov Model (HMM)

  • generative Markov models
  • hidden Markov models
  • model fit likelihood
  • HMM examples

1 2 3 4

slide-17
SLIDE 17

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17

Markov models

  • A (first order) Markov model

is a finite-state system whose behavior depends

  • nly on the current state
  • E.g. generative Markov model:

3

A

S E

C B

p(qn+1|qn) S A B C E 0 1 0 0 0 0 0 0 0 1 0 .8 .1 .1 0 0 .1 .8 .1 0 0 .1 .1 .7 .1 S A B C E qn qn+1 .8 .8 .7 .1 .1 .1 .1 .1 .1 .1 S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E

slide-18
SLIDE 18

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18

Hidden Markov models

  • Markov models where state sequence Q = {qn}

is not directly observable (= ‘hidden’)

  • But, observations X do depend on Q:
  • xn is rv that depends on current state:
  • can still tell something about state seq...

p x q ( )

1 2 3 0.2 0.4 0.6 0.8 10 20 30 1 2 3 1 2 3 4 0.2 0.4 0.6 0.8

  • bservation x

time step n State sequence Emission distributions Observation sequence

xn xn p(x|q) p(x|q)

q = A q = B q = C q = A q = B q = C

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

slide-19
SLIDE 19

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19

(Generative) Markov models (2)

  • HMM is specified by:
  • transition probabilities
  • (initial state probabilities

)

  • emission distributions

p qn

j qn 1 – i

( ) aij ≡ p q1

i

( ) πi ≡ p x q

i

( ) bi x ( ) ≡

k a t k a t

  • k

a t

  • k

a t k a t

  • 0.9 0.1 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.9 0.1 0.0 0.0 0.0 0.9 0.1 p(x|q) x

  • states qi
  • transition

probabilities aij

  • emission

distributions bi(x)

slide-20
SLIDE 20

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20

Markov models for speech

  • Speech models Mj
  • typ. left-to-right HMMs (sequence constraint)
  • observation & evolution are conditionally

independent of rest given (hidden) state qn

  • self-loops for time dilation

ae1

S E

ae2 ae3 q1 x1 q2 x2 q3 x3 q4 x4 q5 x5

slide-21
SLIDE 21

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 21

Markov models for sequence recognition

  • Independence of observations:
  • observation xn depends only current state qn
  • Markov transitions:
  • transition to next state qi+1 depends only on qi

p X Q ( ) p x1 x2 …xN , , q1 q2 …qN , , ( ) = p x1 q1 ( ) p x2 q2 ( ) …p xN qN ( ) ⋅ ⋅ = p xn qn ( )

n 1 = N

= bqn xn ( )

n 1 = N

= p Q M ( ) p q1 q2 …qN , , M ( ) =

p qN q1…qN 1 – ( )p qN 1 – q1…qN 2 – ( )…p q2 q1 ( )p q1 ( ) = p qN qN 1 – ( )p qN 1 – qN 2 – ( )…p q2 q1 ( )p q1 ( ) = p q1 ( ) p qn qn 1 – ( )

n 2 = N

=

πq1

aqn 1

– qn

n 2 = N

=

slide-22
SLIDE 22

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 22

Model fit calculation

  • From ‘state-based modeling’:
  • For HMMs:
  • Hence, solve for M* :
  • calculate

for each available model, scale by prior →

  • Sum over all Qk ???

p X M j ( ) p X1

N Qk M j

, ( ) p Qk M j ( ) ⋅

all Qk

= p X Q ( ) bqn xn ( )

n 1 = N

= p Q M ( ) πq1

aqn 1

– qn

n 2 = N

⋅ = p X M j ( ) p M j ( ) p M j X ( )

slide-23
SLIDE 23

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 23

Summing over all paths

q0 q1 q2 q3 q4 S A A A E S A A B E S A B B E S B B B E .9 x .7 x .7 x .1 = 0.0441 .9 x .7 x .2 x .2 = 0.0252 .9 x .2 x .8 x .2 = 0.0288 .1 x .8 x .8 x .2 = 0.0128 Σ = 0.1109 Σ = p(X | M) = 0.4020 2.5 x 0.2 x 0.1 = 0.05 2.5 x 0.2 x 2.3 = 1.15 2.5 x 2.2 x 2.3 = 12.65 0.1 x 2.2 x 2.3 = 0.506 0.0022 0.0290 0.3643 0.0065

S A B E A S B E 0.9

  • 0.1
  • 0.7
  • 0.2 0.1
  • 0.8 0.2
  • 1

All possible 3-emission paths Qk from S to E

p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)

Observation likelihoods

x1 x2 x3 A B p(x|q) q{ 2.5 0.2 0.1 0.1 2.2 2.3

A B

E S

0.9 0.1 0.7 0.2 0.1 0.8 0.2

Model M1

xn n

1

p(x|B) p(x|A)

2 3

Observations

x1, x2, x3

S 1 2 3 4 A B E States time n

0.1 0.9 0.8 0.2 0.7 0.8 0.2 0.2 0.7 0.1

Paths

slide-24
SLIDE 24

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 24

The ‘forward recursion’

  • Dynamic-programming-like technique to

calculate sum over all Qk

  • Define

as the probability of getting to state qi at time step n (by any path):

  • Then

can be calculated recursively:

αn i ( ) αn i ( ) p x1 x2 …xn qn=qi , , , ( ) p X1

n qn i

, ( ) ≡ = αn

1 +

j ( )

Time steps n Model states qi

αn(i+1) αn(i)

ai+1j aij bj(xn+1)

αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)

i=1 S

slide-25
SLIDE 25

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 25

Forward recursion (2)

  • Initialize
  • Then total probability

→ Practical way to solve for p(X | Mj) and hence perform recognition

α1 i ( ) πi bi x1 ( ) ⋅ = p X1

N M

( ) αN i ( )

i 1 = S

=

Observations X p(X | M1)·p(M1) p(X | M2)·p(M2)

Choose best

slide-26
SLIDE 26

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 26

Optimal path

  • May be interested in actual qn assignments
  • which state was ‘active’ at each time frame
  • e.g. phone labelling (for training?)
  • Total probability is over all paths...
  • ... but can also solve for single best path

= “Viterbi” state sequence

  • Probability along best path to state

:

  • backtrack from final state to get best path
  • final probability is product only (no sum)

→ log-domain calculation just summation

  • Total probability often dominated by best path:

qn

1 + j

αn

1 + *

j ( ) αn

* i

( )aij { } i max b j xn

1 +

( ) ⋅ = p X Q* , M ( ) p X M ( ) ≈

slide-27
SLIDE 27

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 27

Interpreting the Viterbi path

  • Viterbi path assigns each xn to a state qi
  • performing classification based on bi(x)
  • ... at the same time as applying transition

constraints aij

  • Can be used for segmentation
  • train an HMM with ‘garbage’ and ‘target’ states
  • decode on new data to find ‘targets’, boundaries
  • Can use for (heuristic) training
  • e.g. train classifiers based on labels...

10 20 30 1 2 3

Viterbi labels: Inferred classification

xn

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

slide-28
SLIDE 28

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 28

Recognition with HMMs

  • Isolated word
  • choose best
  • Continuous speech
  • Viterbi decoding of one large HMM gives words

p M X ( ) p X M ( )p M ( ) ∝

Model M1

p(X | M1)·p(M1) = ...

Model M2

p(X | M2)·p(M2) = ...

Model M3

p(X | M3)·p(M3) = ...

Input

w ah n th r iy t uw

Input

p(M1) p(M2) p(M3) sil w ah n th r iy t uw

slide-29
SLIDE 29

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 29

HMM example: Different state sequences

K A

S

0.8 0.2 0.9 0.1

T

E

0.8 0.2

Model M1 K O

S

0.8 0.2 0.9 0.1

T

E

0.8 0.2

Model M2 Emission distributions

0 0 1 2 3 4 0.2 0.4 0.6 0.8

p(x | q) x p ( x |

K

) p ( x |

T

) p ( x |

A

) p ( x |

O

)

slide-30
SLIDE 30

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 30

Model inference: Emission probabilities

2 4 6 8 10 12 14 16 18 2 4

Observation sequence

xn

time n / steps

2 4

  • 3
  • 2
  • 1
  • 10
  • 5

2 4

  • 3
  • 2
  • 1
  • 10
  • 5

log p(X,Q* | M) = -47.5 log p(X | M) = -47.0 log p(Q* | M) = -8.3 log p(X | Q*,M) = -39.2

Model M2 state alignment log trans.prob log obs.l'hood state alignment log trans.prob log obs.l'hood

log p(X,Q* | M) = -33.5 log p(X | M) = -32.1 log p(Q* | M) = -7.5 log p(X | Q*,M) = -26.0

Model M1

K A T O

slide-31
SLIDE 31

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 31

Model inference: Transition probabilities

K A

S

0.8 0.18 0.9 0.1 0.02 0.9 0.1

T

E

0.8 0.2

Model M'1 O K A

S

0.8 0.05 0.9 0.1 0.15 0.9 0.1

T

E

0.8 0.2

O Model M'2

  • 3
  • 2
  • 1

2 4 6 8 10 12 14 16 18

time n / steps

2 4

  • 3
  • 2
  • 1
  • 10
  • 5

log trans.prob

log p(X,Q* | M) = -34.9 log p(X | M) = -33.5 log p(Q* | M) = -8.9

Model M'2 state alignment log trans.prob log obs.l'hood

log p(X,Q* | M) = -33.6 log p(X | M) = -32.2 log p(Q* | M) = -7.6 log p(X | Q*,M) = -26.0

Model M'1

slide-32
SLIDE 32

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 32

Validity of HMM assumptions

  • Key assumption is conditional independence:

Given qi, future evolution & obs. distribution are independent of previous events

  • duration behavior: self-loops imply exponential

distribution

  • independence of successive xns

?

n p(N = n) γ 1−γ γ(1−γ)

n xn p(xn|qi)

p X ( ) p xn q

i

( )

=

slide-33
SLIDE 33

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 33

Recap: Recognizer Structure

  • Know how to execute each state
  • .. training HMMs?
  • .. language/word models

Feature calculation sound Acoustic classifier feature vectors Network weights HMM decoder phone probabilities phone & word labeling Word models Language model

slide-34
SLIDE 34

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 34

Summary

  • Speech is modeled as a sequence of features
  • need temporal aspect to recognition
  • best time-alignment of templates = DTW
  • Hidden Markov models are rigorous solution
  • self-loops allow temporal dilation
  • exact, efficient likelihood calculations