Robustness Techniques for Speech Recognition Berlin Chen, 2004 - - PowerPoint PPT Presentation

robustness techniques for speech recognition
SMART_READER_LITE
LIVE PREVIEW

Robustness Techniques for Speech Recognition Berlin Chen, 2004 - - PowerPoint PPT Presentation

Robustness Techniques for Speech Recognition Berlin Chen, 2004 References: 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9 3. T.


slide-1
SLIDE 1

Robustness Techniques for Speech Recognition

Berlin Chen, 2004

References:

  • 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10
  • 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9
  • 3. T. F. Quatieri, Discrete-Time Speech Signal Processing (2002), Chapter 13
slide-2
SLIDE 2

2004 Speech - Berlin Chen 2

Introduction

  • Classification of Speech Variability in Five Categories

Robustness Enhancement Speaker-independency Speaker-adaptation Speaker-dependency Context-Dependent Acoustic Modeling Pronunciation Variation

Linguistic variability Intra-speaker variability Inter-speaker variability Variability caused by the context Variability caused by the environment

slide-3
SLIDE 3

2004 Speech - Berlin Chen 3

Introduction (cont.)

  • The Diagram for Speech Recognition
  • Importance of the robustness in speech recognition

– Speech recognition systems must operate in situations with uncontrollable acoustic environments – The recognition performance is often degraded due to the mismatch in the training and testing conditions

  • Varying environmental noises, different speaker characteristics

(sex, age, dialects), different speaking modes (stylistic, Lombard effect), etc. Feature Extraction Feature Extraction Likelihood computation Likelihood computation Acoustic model Acoustic model Language model Language model Speech signal Recognition results Acoustic Processing Linguistic Processing Lexicon Lexicon Linguistic Network Decoding Linguistic Network Decoding

slide-4
SLIDE 4

2004 Speech - Berlin Chen 4

Introduction (cont.)

  • If a speech recognition system’s accuracy doesn’t degrade

very much under mismatch conditions, the system is called robust

– ASR performance is rather uniform for SNRs greater than 25dB, but there is a very steep degradation as the noise level increases

  • Variant noises exist in varying real-world environments

– periodic, impulsive, or wide/narrow band 316 10 log 10 25

5 . 2 10

≈ = => =

N s N s

E E E E dB

slide-5
SLIDE 5

2004 Speech - Berlin Chen 5

Introduction (cont.)

  • Therefore, several possible robustness approaches have

been developed to enhance the speech signal, its spectrum, and the acoustic models as well

– Environment compensation processing (feature-based) – Environment model adaptation (model-based) – Inherently robust acoustic features (both model- and feature- based)

  • Discriminative acoustic features
slide-6
SLIDE 6

2004 Speech - Berlin Chen 6

The Noise Types

[ ] [ ] [ ] [ ] ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

{ }

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) spectrum

: ,

  • r

spectrum power : ,

  • r

cos 2 Re 2

  • 2

2 2 2 2 2 * 2 2 2 2

⋅ + = ⋅ + = + ≈ + + = + + = ⇔ + = ⇔ + ∗ = S S S S S P P P P P N H S N H S N H S N H S N H S X N H S X m n m h m s m x

nn hh ss xx N H S X

ω ω ω ω ω ω ω ω ω ω ω θ ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω

ω

h[m] s[m] n[m] x[m] A model of the environment.

slide-7
SLIDE 7

2004 Speech - Berlin Chen 7

Additive Noises

  • Additive noises can be stationary or non-stationary

– Stationary noises

  • Such as computer fan, air conditioning, car noise: the power

spectral density does not change over time (the above noises are also narrow-band noises)

– Non-stationary noises

  • Machine gun, door slams, keyboard clicks, radio/TV, and other

speakers’ voices (babble noise, wide band nose, most difficult): the statistical properties change over time

slide-8
SLIDE 8

2004 Speech - Berlin Chen 8

Additive Noises (cont.)

slide-9
SLIDE 9

2004 Speech - Berlin Chen 9

Convolutional Noises

  • Convolutional noises are mainly resulted from channel

distortion (sometimes called “channel noises”) and are stationary for most cases

– Reverberation, the frequency response of microphone, transmission lines, etc.

slide-10
SLIDE 10

2004 Speech - Berlin Chen 10

Noise Characteristics

  • White Noise

– The power spectrum is flat ,a condition equivalent to different samples being uncorrelated, – White noise has a zero mean, but can have different distributions – We are often interested in the white Gaussian noise, as it resembles better the noise that tends to occur in practice

  • Colored Noise

– The spectrum is not flat (like the noise captured by a microphone) – Pink noise

  • A particular type of colored nose that has a low-pass nature, as it

has more energy at the low frequencies and rolls off at high frequency

  • E.g., the noise generated by a computer fan, an air conditioner, or

an automobile

( )

q Snn = ω

[ ] [ ]

m q m Rnn δ =

slide-11
SLIDE 11

2004 Speech - Berlin Chen 11

Noise Characteristics (cont.)

  • Musical Noise

– Musical noise is short sinusoids (tones) randomly distributed

  • ver time and frequency
  • That occur due to, e.g., the drawback of original spectral

subtraction technique and statistical inaccuracy in estimating noise magnitude spectrum

  • Lombard effect

– A phenomenon by which a speaker increases his vocal effect in the presence of background noise (the additive noise) – When a large amount of noise is present, the speaker tends to shout, which entails not only a high amplitude, but also often higher pitch, slightly different formants, and a different coloring (shape) of the spectrum – The vowel portion of the words will be overemphasized by the speakers

slide-12
SLIDE 12

Robustness Approaches

slide-13
SLIDE 13

2004 Speech - Berlin Chen 13

Three Basic Categories of Approaches

  • Speech Enhancement Techniques

– Eliminate or reduce the noisy effect on the speech signals, thus better accuracy with the originally trained models (Restore the clean speech signals or compensate for distortions) – The feature part is modified while the model part remains unchanged

  • Model-based Noise Compensation Techniques

– Adjust (changing) the recognition model parameters (means and variances) for better matching the testing noisy conditions – The model part is modified while the feature part remains unchanged

  • Inherently Robust Parameters for Speech

– Find robust representation of speech signals less influenced by additive or channel noise – Both of the feature and model parts are changed

slide-14
SLIDE 14

2004 Speech - Berlin Chen 14

Assumptions & Evaluations

  • General Assumptions for the Noise

– The noise is uncorrelated with the speech signal – The noise characteristics are fixed during the speech utterance

  • r vary very slowly (the noise is said to be stationary)
  • The estimates of the noise characteristics can be obtained during

non-speech activity

– The noise is supposed to be additive or convolutional

  • Performance Evaluations

– Intelligibility, quality (subjective assessment) – Distortion between clean and recovered speech (objective assessment) – Speech recognition accuracy

slide-15
SLIDE 15

2004 Speech - Berlin Chen 15

Spectral Subtraction (SS) S. F. Boll, 1979

  • A Speech Enhancement Technique
  • Estimate the magnitude (or the power) of clean speech by

explicitly subtracting the noise magnitude (or the power) spectrum from the noisy magnitude (or power) spectrum

  • Basic Assumption of Spectral Subtraction

– The clean speech is corrupted by additive noise – Different frequencies are uncorrelated from each other – and are statistically independent, so that the power spectrum of the noisy speech can be expressed as: – To eliminate the additive noise: – We can obtain an estimate of using the average period of M frames that known to be just noise:

[ ]

m x

[ ]

m s

[ ]

m n

[ ]

m s

[ ]

m n

( ) ( ) ( )

ω ω ω

N S X

P P P + =

( ) ( ) ( )

ω ω ω

N X S

P P P − =

( )

ω

N

P

( ) ( )

− =

=

1 M i i , N N

P M 1 P ˆ ω ω

frames

slide-16
SLIDE 16

2004 Speech - Berlin Chen 16

Spectral Subtraction (cont.)

  • Problems of Spectral Subtraction

– and are not statistically independent such that the cross term in power spectrum can not be eliminated – is possibly less than zero – Introduce “musical noise” when – Need a robust endpoint (speech/noise/silence) detector

[ ]

m s

[ ]

m n

( )

ω

S

P ˆ

( ) ( )

ω ω

N X

P P ≈

slide-17
SLIDE 17

2004 Speech - Berlin Chen 17

Spectral Subtraction (cont.)

  • Modification: Nonlinear Spectral Subtraction (NSS)

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

SNR to according function linear

  • non

a : spectrum noise and noisy smoothed and

  • therwise

, if ω φ ω ω ω β ω β ω φ ω ω φ ω ω : P P P P P , P P ˆ

N X N N X X S

⎩ ⎨ ⎧ ⋅ ⋅ + > − =

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

spectrum noise and noisy smoothed : and

  • therwise

, if , ˆ ω ω ω ω ω ω ω ω

N X N N X N X S

P P P P P P P P ⎩ ⎨ ⎧ ≥ − =

  • r
slide-18
SLIDE 18

2004 Speech - Berlin Chen 18

Spectral Subtraction (cont.)

  • Spectral Subtraction can be viewed as a filtering
  • peration

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

: by ely approximat given is filter n suppressio varying time The ) SNR

  • us

instantane : ( ) that (supposed

2 / 1 S N 1 X N S X N S S X X N X N X S

R 1 1 H P P R R 1 1 P P P P P P P P P P 1 P P P P ˆ

− −

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ∴ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = + ≈ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = − = ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω

Power Spectrum Spectrum Domain

slide-19
SLIDE 19

2004 Speech - Berlin Chen 19

Wiener Filtering

  • A Speech Enhancement Technique
  • From the Statistical Point of View

– The process is the sum of the random process and the additive noise process – Find a linear estimate in terms of the process :

  • Or to find a linear filter such that the sequence

minimizes the expected value of [ ]

m n

[ ]

m s

[ ]

m x

[ ]

m ˆ s [ ]

m x

[ ] [ ] [ ]

m m m n s x + =

[ ] [ ] [ ]

∞ −∞ =

− = ∗ =

l

l m x l h m h m x m s ] [ ] [ ˆ

[ ]

m h

[ ] [ ] [ ]

m h m x m s ˆ ∗ =

[ ] [ ] ( )

2

m s m s ˆ − Noisy Speech

A linear filter h[n]

[ ]

m x

[ ]

m s ˆ

Clean Speech

slide-20
SLIDE 20

2004 Speech - Berlin Chen 20

Wiener Filtering (cont.)

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ( ) [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ( ) ( ) ( )

ω ω ω

xx ss x s l x k k l k k l k k l k k l

S H S k R k h k R l k R l h k m n m s k m s m s k m x l m x l h k m n k m s m s k m x l m x l h k m x m s k m x l m x l h k m x m s k h F l m x l h m s E Minimize F = ⇒ ∗ = ⇒ − = − + − ⇒ − − = − + − ⇒ − − = − ⇒ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = − ∀ ⇒ = ∂ ∀ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = ∞ −∞ = 2

  • Minimize the expectation of the squared error (MMSE

estimate)

[ ] [ ]

t! independen lly statistica are and m n m s

Take summation for k Take Fourier transform

[ ] [ ] [ ]

] [ and

  • f

sequences ation autocorrel ly the respective are : and n x n s n R n R

x s

slide-21
SLIDE 21

2004 Speech - Berlin Chen 21

Wiener Filtering (cont.)

  • Minimize the expectation of the squared error (MMSE

estimate) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

) (where filter Wiener noncausal the called is , ω ω ω ω ω ω ω ω ω ω ω ω

nn ss xx nn ss ss xx ss xx ss

S S S S S S S S H S H S + = + = = ⇒ = Q

slide-22
SLIDE 22

2004 Speech - Berlin Chen 22

Wiener Filtering (cont.)

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

)

  • us

instantane : ( 1 1

1

  • 1
  • SNR

P P R , R 1 P P P P P S S S H

N S S N N S S nn ss ss

ω ω ω ω ω ω ω ω ω ω ω ω ω = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = + = + =

  • The time varying Wiener Filter also can be expressed in

a similar form as the spectral subtraction

SS vs. Wiener Filter:

  • 1. Wiener filter has stronger attenuation

at low SNR region

  • 2. Wiener filter does not invoke an

absolute thresholding

( ) ( )

ω ω

N S

P P log 10

slide-23
SLIDE 23

2004 Speech - Berlin Chen 23

Wiener Filtering (cont.)

  • Wiener Filtering can be realized only if we know the

power spectra of both the noise and the signal

– A chicken-and-egg problem

  • Approach - I : Ephraim(1992) proposed the use of an

HMM where, if we know the current frame falls under, we can use it’s mean spectrum as

– In practice, we do not know what state each frame falls into either

  • Weight the filters for each state by a posterior probability that frame

falls into each state

( ) ( )

ω ω

S ss

P S

  • r
slide-24
SLIDE 24

2004 Speech - Berlin Chen 24

Wiener Filtering (cont.)

  • Approach - II :

– The background/noise is stationary and its power spectrum can be estimated by averaging spectra over a known background region – For the non-stationary speech signal, its time-varying power spectrum can be estimated using the past Wiener filter (of previous frame)

  • The initial estimate of the speech spectrum can be derived from

spectral subtraction

– Sometimes introduce musical noise

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

ω ω ω ω ω ω ω ω ω ω , , , ~ , ˆ , ˆ , filter) Wiener : index, frame : ( , , 1 , , ˆ t H t P t P P t P t P t H ) H( t t H t P t P

X S N S S X S

= + = ∴ ⋅ − =

slide-25
SLIDE 25

2004 Speech - Berlin Chen 25

Wiener Filtering (cont.)

  • Approach - III :

– Slow down the rapid frame-to-frame movement of the object speech power spectrum estimate by apply temporal smoothing

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

, , , , ˆ , ˆ , in , ˆ replace to , use Then , ˆ 1 , 1 ~ ,

S S

ω ω ω ω ω ω ω ω ω ω ω α ω α ω

N N S S S S S S S

P t P t P t H P t P t P t H t P t P t P t P t P + = ⇒ + = ⋅ − + − ⋅ = ) ) ) )

slide-26
SLIDE 26

2004 Speech - Berlin Chen 26

Wiener Filtering (cont.)

Clean Speech Noisy Speech Enhanced Noise Speech Using Approach – III

85 . = τ

Other more complicate Wiener filters

slide-27
SLIDE 27

2004 Speech - Berlin Chen 27

The Effectives of Active Noise

slide-28
SLIDE 28

2004 Speech - Berlin Chen 28

Cepstral Mean Normalization (CMN)

  • A Speech Enhancement Technique and sometimes

called Cepstral Mean Subtraction (CMS)

  • CMN is a powerful and simple technique designed to

handle convolutional (Time-invariant linear filtering) distortions

[ ] [ ] [ ]

n h n s n x ∗ =

l l l

H S H S SH X + = + = =

2 2 2

log log log ( ) ( ) ( )

ω ω ω H S X =

( )

l l l l l

CH CS H S C CX + = + =

( )

l l 1 T t l t l l 1 T t t l l

CH CS CH CS T 1 CX CS T 1 CS + = + = =

∑ ∑

− = − =

and

( )

( )

( )

( )

l l l l l l l l l l

) 2 ( CH CS ) 2 ( H S C 2 CX , ) 1 ( CH CS ) 1 ( H S C 1 CX + = + = + = + = : Testing : Training channels different two from recored were materials speech testing and training the if

( ) ( ) ( ) ( )

l l l l l l l l

CS CS 2 CX 2 CX CS CS 1 CX 1 CX − = − − = −

The spectral characteristics of the microphone and room acoustics thus can be removed ! Time Domain Spectral Domain Log Power Spectral Domain Cepstral Domain

Can be eliminated if the assumption of zero-mean speech contribution!

slide-29
SLIDE 29

2004 Speech - Berlin Chen 29

Cepstral Mean Normalization (cont.)

  • Some Findings

– Interesting, CMN has been found effective even the testing and training utterances are within the same microphone and environment

  • Variations for the distance between the mouth and the microphone

for different utterances and speakers

– Be careful that the duration/period used to estimate the mean

  • f noisy speech
  • Why?

– Problematic when the acoustic feature vectors are almost identical within the selected time period

slide-30
SLIDE 30

2004 Speech - Berlin Chen 30

Cepstral Mean Normalization (cont.)

  • Performance

– For telephone recordings, where each call has different frequency response, the use of CMN has been shown to provide as much as 30 % relative decrease in error rate – When a system is trained on one microphone and tested on another, CMN can provide significant robustness

slide-31
SLIDE 31

2004 Speech - Berlin Chen 31

Cepstral Mean Normalization (cont.)

  • CMN has been shown to improve the robustness not
  • nly to varying channels but also to the noise

– White noise added at different SNRs – System trained with speech with the same SNR (matched Condition)

Cepstral delta and delta-delta features are computed prior to the CMN operation so that they are unaffected.

slide-32
SLIDE 32

2004 Speech - Berlin Chen 32

Cepstral Mean Normalization (cont.)

  • From the other perspective

– We can interpret CMN as the operation of subtracting a low-pass temporal filter , where all the coefficients are identical and equal to , which is a high-pass temporal filter – Alleviate the effect of conventional noise introduced in the channel

[ ]

n d T T 1

Temporal (Modulation) Frequency

slide-33
SLIDE 33

2004 Speech - Berlin Chen 33

Cepstral Mean Normalization (cont.)

  • Real-time Cepstral Normalization

– CMN requires the complete utterance to compute the cepstral mean; thus, it cannot be used in a real-time system, and an approximation needs to be used – Based on the above perspective, we can implement other types

  • f high-pass filters

( )

mean) cepstral : ( ,

t l 1 t l t l t l

CX CX 1 CX CX

⋅ − + ⋅ = α α

slide-34
SLIDE 34

2004 Speech - Berlin Chen 34

RASTA Temporal Filter Hyneck Hermansky, 1991

  • A Speech Enhancement Technique
  • RASTA (Relative Spectral)

Assumption

– The linguistic message is coded into movements of the vocal tract (i.e., the change of spectral characteristics) – The rate of change of non-linguistic components in speech often lies outside the typical rate of change of the vocal tact shape

  • E.g. fix or slow time-varying linear communication channels

– A great sensitivity of human hearing to modulation frequencies around 4Hz than to lower or higher modulation frequencies

Effect

– RASTA Suppresses the spectral components that change more slowly or quickly than the typical rate of change of speech

slide-35
SLIDE 35

2004 Speech - Berlin Chen 35

RASTA Temporal Filter (cont.)

  • The IIR transfer function
  • An other version

( ) ( ) ( )

1 4 3 1 4 x x

z 98 . 1 z 2 z z 2 z 1 . z C z C ~ z H

− − − −

− − − + ⋅ = =

Frame index

MFCC stream

H(z) H(z) H(z) RASTA has a peak at about 4Hz (modulation frequency) modulation frequency 100 Hz

New MFCC stream

[ ]

t c

[ ]

t c ~

( )

98 . 1 2 2 1 .

1 4 3 1 − − − −

− − − + ⋅ = z z z z z H

[ ] [ ] [ ] [ ] [ ] [ ]

4 2 . 3 1 . 1 1 . 2 . 1 ~ 98 . ~ − ⋅ + − ⋅ − − ⋅ + ⋅ + − ⋅ = t t t t t t c c c c c c

slide-36
SLIDE 36

2004 Speech - Berlin Chen 36

Retraining on Corrupted Speech

  • A Model-based Noise Compensation Technique
  • Matched-Conditions Training

– Take a noise waveform from the new environment, add it to all the utterance in the training database, and retrain the system – If the noise characteristics are known ahead of time, this method allow as to adapt the model to the new environment with relatively small amount of data from the new environment, yet use a large amount of training data

slide-37
SLIDE 37

2004 Speech - Berlin Chen 37

Retraining on Corrupted Speech (cont.)

  • Multi-style Training

– Create a number of artificial acoustical environments by corrupting the clean training database with noise samples of varying levels (30dB, 20dB, etc.) and types (white, babble, etc.), as well as varying the channels – All those waveforms (copies of training database) from multiple acoustical environments can be used in training

slide-38
SLIDE 38

2004 Speech - Berlin Chen 38

Model Adaptation

  • A Model-based Noise Compensation Technique
  • The standard adaptation methods for speaker adaptation

can be used for adapting speech recognizers to noisy environments

– MAP (Maximum a Posteriori) can offer results similar to those of matched conditions, but it requires a significant amount of adaptation data – MLLR (Maximum Likelihood Regression) can achieve reasonable performance with about a minute of speech for minor

  • mismatch. For severe mismatches, MLLR also requires a larger

amount of adaptation data

slide-39
SLIDE 39

2004 Speech - Berlin Chen 39

Signal Decomposition Using HMMs

  • A Model-based Noise Compensation Technique
  • Recognize concurrent signals (speech and noise)

simultaneously

– Parallel HMMs are used to model the concurrent signals and the composite signal is modeled as a function of their combined

  • utputs
  • Three-dimensional Viterbi Search

Clean speech HMM Noise HMM

(especially for non-stationary noise)

Computationally Expensive for both Training and Decoding !

slide-40
SLIDE 40

2004 Speech - Berlin Chen 40

Parallel Model Combination (PMC)

  • A Model-based Noise Compensation Technique
  • By using the clean-speech models and a noise model,

we can approximate the distributions obtained by training a HMM with corrupted speech

slide-41
SLIDE 41

2004 Speech - Berlin Chen 41

Parallel Model Combination (cont.)

  • The steps of Standard Parallel Model Combination (Log-

Normal Approximation)

Clean speech HMM’s Noise HMM’s Cepstral domain Log-spectral domain Linear spectral domain

l l

Σ µ

c l

µ C µ

1 −

=

T c l

) (

1 1 − −

= C Σ C Σ ( )

2 exp

l ii l i i

Σ + = µ µ

( ) [ ]

1 exp − Σ = Σ

l ij j i ij

µ µ

c c

Σ µ Σ µ

µ µ µ ~ g ˆ + =

Σ Σ Σ ~ g ˆ

2

+ =

Σ µ ˆ ˆ

l l

Σ µ ˆ ˆ

c c

Σ µ ˆ ˆ ( )

( )

1 log 2 1 ˆ log ˆ

2

ˆ ˆ +

− =

Σ

i ii

i l i µ

µ µ

( )

1 log ˆ

ˆ ˆ ˆ

+ = Σ

Σ

j i ij

l ij µ µ l c

µ C µ ˆ ˆ =

T l c

C Σ C Σ ˆ ˆ =

Σ µ ~ ~ Noisy speech HMM’s In linear spectral domain, the distribution is lognormal Because speech and noise are independent and additive in the linear spectral domain Log-normal approximation

(Assume the new distribution is lognormal)

Constraint: the estimate of variance is positive

slide-42
SLIDE 42

2004 Speech - Berlin Chen 42

Parallel Model Combination (cont.)

  • Modification-I: Perform the model combination in the Log-

Spectral Domain (the simplest approximation)

– Log-Add Approximation: (without compensation of variances)

  • The variances are assumed to be small

– A simplified version of Log-Normal approximation

  • Reduction in computational load
  • Modification-II: Perform the model combination in the

Linear Spectral Domain (Data-Driven PMC, DPMC, or Iterative PMC)

– Use the speech models to generate noisy samples (corrupted speech observations) and then compute a maximum likelihood of these noisy samples – This method is less computationally expensive than standard PMC with comparable performance

( ) ( ) ( )

l l l

µ µ µ ~ exp exp log ˆ + =

slide-43
SLIDE 43

2004 Speech - Berlin Chen 43

Parallel Model Combination (cont.)

  • Modification-II: Perform the model combination in the

Linear Spectral Domain (Data-Driven PMC, DPMC)

Noise HMM

Linear spectral domain

Clean Speech HMM

Cepstral domain

Generating samples Domain transform

Noisy Speech HMM

Apply Monte Carlo simulation to draw random cepstral vectors (for example, at least 100 for each distribution)

slide-44
SLIDE 44

2004 Speech - Berlin Chen 44

Parallel Model Combination (cont.)

  • Data-Driven PMC
slide-45
SLIDE 45

2004 Speech - Berlin Chen 45

Vector Taylor Series (VTS) P. J. Moreno,1995

  • A Model-based Noise Compensation Technique
  • VTS Approach

– Similar to PMC, the noisy-speech-like models is generated by combining of clean speech HMM’s and the noise HMM – Unlike PMC, the VTS approach combines the parameters of clean speech HMM’s and the noise HMM linearly in the log- spectral domain

Power spectrum Log Power spectrum Non-linear function

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

( )

( )

( ) ( )

( )

l l l l l l H S N

H S N l l l l l l l l H S N l l P P P H S H S N H S N H S l N H S X

e N H S f N H S f H S e H S e P P P P P P P P P P X P P P P

− − − − − −

+ = + + = + + + = + + + = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + = + = + = 1 log , , where , , , 1 log 1 log log log 1 log log

log log log ω ω ω

ω ω ω ω ω ω ω ω ω ω ω ω ω ω

Is a vector function

slide-46
SLIDE 46

2004 Speech - Berlin Chen 46

Vector Taylor Series (cont.)

  • The Taylor series provides a polynomial representation
  • f a function in terms of the function and its derivatives at

a point

– Application often arises when nonlinear functions are employed and we desire to obtain a linear approximation – The function is represented as an offset and a linear term

( ) ( ) ( )( ) ( )( )

( )(

)( )

( )

n n n

x x

  • x

x x f n x x x f x x x f x f x f R R f ! 1 .... 2 1 : − + − + + − ′ ′ + − ′ + = →

slide-47
SLIDE 47

2004 Speech - Berlin Chen 47

Vector Taylor Series (cont.)

  • Apply Taylor Series Approximation

– VTS-0: use only the 0th-order terms of Taylor Series – VTS-1: use only the 0th- and 1th-order terms of Taylor Series – is the vector function evaluated at a particular vector point

  • If VTS-0 is used

( )

( ) ( )( ) ( )( ) ( )( )

..... , , , , , , , , , , + − + − + − + ≅

l l l l l l l l l l l l l l l l l l l l l l l l

N N dN N H S df H H dH N H S df S S dS N H S df N H S f H S N f

[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( )

t) independen are and (if Gaussian) also is (

l l l l l l l l l l l l l l l l l l l l l l l l l l l l

H S X u , u , u f u u u , u , u f E u u N , H , S f E u u u N , H , S f H S E X E

h s x n h s h s n h s h s h s x

Σ + Σ ≅ Σ + + ≅ + + ≅ + + = + + =

( )

Gaussian) also is ( domain spectrum power log in the , , (constant) bias a as it regard can we invariant, time

  • linear

is filter channel the If

l l l l l l l

X u , g , u f g u u g

s x n s s x

Σ ≅ Σ + + ≅

0-th order VTS

( )

l l l

N H S f , , To get the clean speech statistics

slide-48
SLIDE 48

2004 Speech - Berlin Chen 48

Vector Taylor Series (cont.)

slide-49
SLIDE 49

2004 Speech - Berlin Chen 49

Retraining on Compensated Features

  • A Model-based Noise Compensation Technique that also

Uses enhanced Features (processed by SS, CMN, etc.)

– Combine speech enhancement and model compensation

slide-50
SLIDE 50

2004 Speech - Berlin Chen 50

Principal Component Analysis

  • Principal Component Analysis (PCA) :

– Widely applied for the data analysis and dimensionality reduction in order to derive the most “expressive” feature – Criterion: for a zero mean r.v. x∈RN, find k (k≤N) orthonormal vectors {e1, e2,…, ek} so that – (1) var(e1

T x)=max 1

(2) var(ei

T x)=max i

subject to ei⊥ ei-1 ⊥…… ⊥e1 1≤ i ≤k – {e1, e2,…, ek} are in fact the eigenvectors

  • f the covariance matrix (Σx) for x

corresponding to the largest k eigenvalues – Final r.v y ∈R k : the linear transform (projection) of the original r.v., y=ATx A=[e1 e2 …… ek]

data Principal axis

slide-51
SLIDE 51

2004 Speech - Berlin Chen 51

Principal Component Analysis (cont.)

slide-52
SLIDE 52

2004 Speech - Berlin Chen 52

Principal Component Analysis (cont.)

  • Properties of PCA

– The components of y are mutually uncorrelated E{yiyj}=E{(ei

Tx) (ej Tx)T}=E{(ei Tx) (xTej)}=ei TE{xxT} ej =ei TΣxej

= λjei

Tej=0 , if i≠j

∴ the covariance of y is diagonal

– The error power (mean-squared error) between the original vector x and the projected x’ is minimum x=(e1

Tx)e1+ (e2 Tx)e2 + ……+(ek Tx)ek + ……+(eN Tx)eN

x’=(e1

Tx)e1+ (e2 Tx)e2 + ……+(ek Tx)ek (Note : x’∈RN)

error r.v : x-x’= (ek+1

Tx)ek+1+ (ek+2 Tx)ek+2 + ……+(eN Tx)eN

E((x-x’)T(x-x’))=E((ek+1

Tx) ek+1 Tek+1 (ek+1 Tx))+……+E((eN Tx) eN TeN

(eN

Tx))

=var(ek+1

Tx)+ var(ek+2 Tx)+…… var(eN Tx)

= λk+1+ λk+2+…… +λN minimum

slide-53
SLIDE 53

2004 Speech - Berlin Chen 53

PCA Applied in Inherently Robust Features

  • Application 1 : the linear transform of the original

features (in the spatial domain)

Original feature stream xt

Frame index

AT AT AT AT

transformed feature stream zt

Frame index

zt= ATxt The columns of A are the “first k” eigenvectors of Σx

slide-54
SLIDE 54

2004 Speech - Berlin Chen 54

PCA Applied in Inherently Robust Features (cont.)

  • Application 2 : PCA-derived temporal filter

(in the temporal domain)

– The effect of the temporal filter is equivalent to the weighted sum of sequence of a specific MFCC coefficient with length L slid along the frame index

quefrency

Frame index

B1(z) B2(z) Bn(z)

L

zk(1) zk(2) zk(3)

Original feature stream xt

The impulse response of Bk(z) is one of the eigenvectors of the covariance for zk

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

N n m y m y m y m y K N x k N x N x N x K n x k n x n x n x K x k x x x K x k x x x K x k x x x

K k

x x x x x 3 2 1 ) , ( ) , ( ) 2 , ( ) 1 , ( ) , ( ) , ( ) 2 , ( ) 1 , ( ) , 3 ( ) , 3 ( ) 2 , 3 ( ) 1 , 3 ( ) , 2 ( ) , 2 ( ) 2 , 2 ( ) 1 , 2 ( ) , 1 ( ) , 1 ( ) 2 , 1 ( ) 1 , 1 (

2 1

L L M M M M L M M L M M M M M M → → → → ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡

zk(n)=[ yk(n) yk(n+1) yk(n+2) …… yk(n+L-1)]T

( )

+ − =

+ − =

1 1

1 1

L N n k n

L N

k

z

z

µ

( )

( )

( )

( )

+ − =

− − + − = Σ

1 1

1 1

L N n T k k

k k k

n n L N

z z z

z z µ µ

( ) ( ) ( )

n k n x

k T k

z e 1 , ˆ =

The element in the new feature vector From Dr. Jei-wei Hung

slide-55
SLIDE 55

2004 Speech - Berlin Chen 55

PCA Applied in Inherently Robust Features (cont.)

The frequency responses of the 15 PCA-derived temporal filters

From Dr. Jei-wei Hung

slide-56
SLIDE 56

2004 Speech - Berlin Chen 56

PCA Applied in Inherently Robust Features (cont.)

  • Application 2 : PCA-derived temporal filter

From Dr. Jei-wei Hung Mismatched condition Matched condition Filter length L=10

slide-57
SLIDE 57

2004 Speech - Berlin Chen 57

PCA Applied in Inherently Robust Features (cont.)

  • Application 3 : PCA-derived filter bank

Power spectrum

  • btained by DFT

x1 x2 x3

h1 h3 h2

hk is one of the eigenvectors

  • f the covariance

for xk

From Dr. Jei-wei Hung

slide-58
SLIDE 58

2004 Speech - Berlin Chen 58

PCA Applied in Inherently Robust Features (cont.)

  • Application 3 : PCA-derived filter bank

From Dr. Jei-wei Hung

slide-59
SLIDE 59

2004 Speech - Berlin Chen 59

Linear Discriminative Analysis

  • Linear Discriminative Analysis (LDA)

– Widely applied for the pattern classification – In order to derive the most “discriminative” feature – Criterion : assume wj, µj and Σj are the weight, mean and covariance of class j, j=1……N. Two matrices are defined as: Find W=[w1 w2 ……wk] such that

– The columns wj of W are the eigenvectors of Sw

  • 1SB

having the largest eigenvalues

( )( )

∑ ∑

= =

= − − =

N j j j w N j T j j j b

w w

1 1

: covariance class

  • Within

: covariance class

  • Between

Σ S µ µ µ µ S W S W W S W W

W w T b T

max arg ˆ =

slide-60
SLIDE 60

2004 Speech - Berlin Chen 60

Linear Discriminative Analysis (cont.)

The frequency responses of the 15 LDA-derived temporal filters

From Dr. Jei-wei Hung

slide-61
SLIDE 61

2004 Speech - Berlin Chen 61

Minimum Classification Error

  • Minimum Classification Error (MCE):

– General Objective : find an optimal feature presentation or an

  • ptimal recognition model to minimize the expected error of

classification – The recognizer is often operated under the following decision rule : C(X)=Ci if gi(X,Λ)=maxj gj(X,Λ) Λ={λ(i)}i=1……M (M models, classes), X : observations, gi(X,Λ): class conditioned likelihood function, for example, gi(X,Λ)=P(X|λ(i)) – Traditional Training Criterion : find λ(i) such that P(X|λ(i)) is maximum (Maximum Likelihood) if X ∈Ci

  • This criterion does not always lead to minimum classification error,

since it doesn't consider the mutual relationship between different classes

  • For example, it’s possible that P(X|λ(i)) is maximum but X ∉Ci
slide-62
SLIDE 62

2004 Speech - Berlin Chen 62

Minimum Classification Error (cont.)

Type I error Type II error

( )

( )

k k

C KW k LR P ∉

( )

( )

k k

C KW k LR P ∈

( )

k LR

k

τ

Threshold Example showing histograms of the likelihood ratio when keyword and

k k

C KW ∉

k k

C KW ∈

( )

k LR

Type I error: False Rejection Type II error: False Alarm/False Acceptance

slide-63
SLIDE 63

2004 Speech - Berlin Chen 63

Minimum Classification Error (cont.)

  • Minimum Classification Error (MCE) (Cont.):

– One form of the class misclassification measure : – A continuous loss function is defined as follows : – Classifier performance measure :

( )

( )

( )

( )

( ) ( )

( ) ( )

0) (error tion classifica correct a implies 1) (error ication misclassif a implies , exp 1 1 log ,

1

= < = ≥ ∈ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − + − =

X d X d C X X g M X g X d

i i i i j i i i α

α λ λ ( ) ( ) ( ) ( ) ( )

θ γ + − + = ∈ = Λ d d l function sigmoid the where C X X d l X l

i i i

exp 1 1 ,

( ) ( ) [ ] ( ) ( )

∑∑

=

∈ δ = =

X M i i i X

C X X l X L E L

1

, , Λ Λ Λ

slide-64
SLIDE 64

2004 Speech - Berlin Chen 64

Minimum Classification Error (cont.)

  • Using MCE in model training :

– Find Λ such that the above objective function in general cannot be minimized directly but the local minimum can be achieved using gradient decent algorithm

  • Using MCE in robust feature representation

( ) ( ) [ ]

Λ = Λ = Λ

Λ Λ

, min arg min arg ˆ X L E L

X

( )

: ,

1

Λ ∂ Λ ∂ − =

+

  • f

parameter arbitrary an w w L w w

t t

ε

( )

( )

( ) [ ]

y accordingl changed also is model the changed, is

  • n

presentati feature while : feature

  • riginal

the

  • f

m a transfor : , min arg ˆ Note X f X f L E f

f X f

Λ =