Feature-based Robust Techniques For Speech Recognition presented by - - PowerPoint PPT Presentation

feature based robust techniques for speech recognition
SMART_READER_LITE
LIVE PREVIEW

Feature-based Robust Techniques For Speech Recognition presented by - - PowerPoint PPT Presentation

Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017 Outline An Introduction of Robust ASR The 1 st proposed method (Ch5) The


slide-1
SLIDE 1

Feature-based Robust Techniques For Speech Recognition

Nguyen Duc Hoang Ha

08-Mar-2017 presented by Supervisors

  • Assoc. Prof. Chng Eng Siong
  • Prof. Li Haizhou
slide-2
SLIDE 2

2

Outline

➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5) – The major Contribution:

Feature Adaptation Using Spectro-Temporal Information

➢The 2nd proposed method (Ch3):

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments

➢The 3rd proposed method (Ch4):

A Particle Filter Compensation Approach to Robust LVCSR

➢Conclusions and Future Directions

slide-3
SLIDE 3

3

Introduction ST Transform NN-VTS PFC Conclusion

Automatic Speech Recognition (ASR) [Huang2001]

The aim is to decode the speech signal into text.

hello /h e l o/

AM LM w

slide-4
SLIDE 4

4

Applications of the ASR system

➢Siri (http://www.apple.com/ios/siri/) ➢Amazon Echo

(https://en.wikipedia.org/wiki/Amazon_Echo)

➢Google Speech Recognition API

(https://cloud.google.com/speech/) ...

Introduction ST Transform NN-VTS PFC Conclusion

slide-5
SLIDE 5

5

Challenges of the ASR system [Chelba2010, Li2014]

➢Non-native speakers ➢Dialect variations ➢Dis-fluencies ➢Out-of-vocabulary words ➢Language modeling ➢Noise robustness

Introduction ST Transform NN-VTS PFC Conclusion

slide-6
SLIDE 6

6

ASR in Noisy Environments [Xiao2009, Li2014]

Introduction ST Transform NN-VTS PFC Conclusion

Clean speech model Noisy speech features

slide-7
SLIDE 7

7

Feature/Model Compensation [Xiao2009, Li2014]

Introduction ST Transform NN-VTS PFC Conclusion

(A) (B) Two major approaches: (A) Feature-based approach (B) Model-based approach

slide-8
SLIDE 8

8

Feature/Model Compensation

➢Feature-Based Approach ➢Model-based Approach

Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fMLLR [Digalakis1995,Gales1998], ... Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009]

Introduction ST Transform NN-VTS PFC Conclusion (A) (B)

slide-9
SLIDE 9

9

Multi-condition training approach

[Ng2016]

Introduction ST Transform NN-VTS PFC Conclusion

(A) (B) (C) Noisy data collection / simulation

slide-10
SLIDE 10

10

Robust ASR Data Collection Simulation Feature-based Approach Model-based Approach Clean Feature Estimation (e.g. SS [Boll1979], MMSE [Ephraim1984], ...) Feature Transformation (e.g. fMLLR [Digalakis1995,Gales1998]) … Filtering Approach (e.g. RASTA [Hermansky1994], ...) MAP Model Adaptation [Gauvain1994] MLLR, CMLLR Model Adaptation [Leggetter1995, Gales1998] VTS Model Compensation [Acero2000, Li2009] ... Deep learning approaches (e.g. DNN AM [Hinton2012]) ...

Introduction ST Transform NN-VTS PFC Conclusion

(A) (B) (C)

slide-11
SLIDE 11

11

Contributions – Three Proposed Methods

(A3) PFC-LVCSR

(for background noise)

(A1) ST-Transform

(for background noise and reverberation)

(A2)NN – (B2) VTS

(for non-stationary noise)

(2) (A3) (A1)

Introduction ST Transform NN-VTS PFC Conclusion

(A2) (B2)

slide-12
SLIDE 12

12

Contributions – Three Proposed Methods

1) Spectra-Temporal Transformation (ST-Transform)

  • D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Generalization of temporal filter and linear transformation for robust speech recognition. In

ICASSP, Italy, 2014.

  • D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Feature adaptation using linear spectro-temporal transform for robust speech recognition.

IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1–1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme)

3) Particle Filter Compensation (PFC) for LVCSR

  • D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust lvcsr. In APSIPA

ASC, Taiwan, 2013.

2) Noise Normalization (NN) – Vector Taylor Series Model Compensation (VTS)

  • D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. An analysis of vector taylor series model compensation for non-stationary noise in speech
  • recognition. In ISCSLP, Hong Kong, 2012.

Introduction ST Transform NN-VTS PFC Conclusion

slide-13
SLIDE 13

13

http://reverb2014.dereverberation.com/introduction.html

Contributions of ST Transform

slide-14
SLIDE 14

14

Outline

➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5):

Feature Adaptation Using Spectro-Temporal Information

➢The 2nd proposed method (Ch3):

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments

➢The 3rd proposed method (Ch4):

A Particle Filter Compensation Approach to Robust LVCSR

➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion

slide-15
SLIDE 15

15

Feature Adaptation Using Spectro-Temporal Information

Introduction ST Transform NN-VTS PFC Conclusion

(A1) ST-Transform

slide-16
SLIDE 16

16

Feature Adaptation Using Spectro-Temporal Information

Introduction ST Transform NN-VTS PFC Conclusion

Noisy features ST Transform

y1:T

Transformed features

^ x1:T

Distribution of transformed features Distribution of training features Kullback–Leibler divergence The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features.

slide-17
SLIDE 17

17

Changing Notations for Generalization of the Feature Transformation

Introduction ST Transform NN-VTS PFC Conclusion

Input features Feature Transformation y = f(x)

x1:T

Transformed features

y1:T

Distribution of transformed features Distribution of training features Kullback–Leibler divergence

x denotes the input feature. y denotes the output feature. Transformation “y = f(x)” is more natural.

slide-18
SLIDE 18

18

ST Transform: Generalized Linear Transform

A) e.g. CMN

[Atal1974], MVN [Viikki1998]

B) e.g.

fMLLR [Digalakis1 995,Gales 1998]

C) e.g.

RASTA [Hermansky1994] TSN [Xiao2009] Introduction ST Transform NN-VTS PFC Conclusion

Input: Output:

slide-19
SLIDE 19

19

ST Transform: Generalized Linear Transform

Introduction ST Transform NN-VTS PFC Conclusion

input feature vectors

  • utput feature vector

Input: Output:

slide-20
SLIDE 20

20

ST Transform: Generalized Linear Transform

Introduction ST Transform NN-VTS PFC Conclusion

slide-21
SLIDE 21

21

ST Transform: Generalized Linear Transform

Introduction ST Transform NN-VTS PFC Conclusion

Matrix form of W

slide-22
SLIDE 22

22

EM Algorithm for Parameter Estimation

From L2-Norm From KL-divergence criterion

Introduction ST Transform NN-VTS PFC Conclusion

  • Ref. Model

Output features

Covariance matrix of Output features

slide-23
SLIDE 23

23

Insufficient Adaptation Data Issue Issues:

  • Unreliable statistics
  • Too big the degrees of freedom in ST transform

Solutions:

+ Statistics smoothing approach + Sparse ST transform

Introduction ST Transform NN-VTS PFC Conclusion

slide-24
SLIDE 24

24

Statistics Smoothing Approach

From test data From training or prior data

The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data.

From test data From training or prior data

Introduction ST Transform NN-VTS PFC Conclusion

slide-25
SLIDE 25

25

Sparse ST Transformation – Cross Transform

A) e.g. CMN, MVN, HEQ B) e.g. fMLLR C) e.g. RASTA, ARMA, TSN

Introduction ST Transform NN-VTS PFC Conclusion

slide-26
SLIDE 26

26

ST Transform: Generalized Linear Transform

Introduction ST Transform NN-VTS PFC Conclusion

Matrix form of W

slide-27
SLIDE 27

27

Introduction ST Transform NN-VTS PFC Conclusion

Matrix form of W

slide-28
SLIDE 28

28

Experimental Settings

➢REVERB Challenge 2014 benchmark task for noisy and

reverberant speech recognition:

➢Clean condition training scheme:

➢ Training data: 7861 clean utterances from the WSJCAM0 database (about

17.5 hours from 92 speakers)

➢ Speech features:

13 MFCCs + 13 ∆ + 13 ∆∆ MVN post-processing

➢ Acoustic model: 3115 tied-states, 10 mixtures/state ➢ The development (dev) and evaluation (eval) data sets:

➢ Actual meeting room recording of MC-WSJ-AV corpus ➢ Near setting: 100cm distance between the microphone and the speaker ➢ Far setting: 250cm distance between the microphone and the speaker

Introduction ST Transform NN-VTS PFC Conclusion

slide-29
SLIDE 29

29

An Analysis of Window Length on Dev Set

We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments

  • n Eval set.

Introduction ST Transform NN-VTS PFC Conclusion

slide-30
SLIDE 30

30

Three different adaptation schemes

Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance

Introduction ST Transform NN-VTS PFC Conclusion

slide-31
SLIDE 31

31

Experiments for Cascaded Transforms

+ Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fMLLR Full Batch Mode Speaker Mode Utterance Mode

58 59 60 61 62 63 64 65 66 67

Cross Transform Temporal Filter ◦ fMLLR FMLLR ◦ Temporal Filter Cross Transform ◦ fMLLR FMLLR ◦ Cross Transform % Average WER

Introduction ST Transform NN-VTS PFC Conclusion

Transform 1 Transform 2 Input features Output features

slide-32
SLIDE 32

32

Hybrid Cascaded Transforms

+ Full batch mode (fb): deal with session-wise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and back- ground noise change + Statistics smoothing (smooth):

  • Ref. statistics provided from the

batch mode

Introduction ST Transform NN-VTS PFC Conclusion

utt1, utt2, …, uttN Transform 1 in full batch mode utt1a, utt2a, …, uttNa

Transform 2 in utterance mode Transform 2 in utterance mode Transform 2 in utterance mode

... utt1b, utt2b, …, uttNb

slide-33
SLIDE 33

33

Cascaded Transforms vs. Hybrid Cascaded Transforms vs. Hybrid Cascaded Transforms + Stats. Smoothing

+ Full batch mode (fb): deal with session- wise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and back- ground noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode Observations: + The combination of batch and utterance mode transforms performs the best. + (1) vs (2): 3 % absolute reduction in WER + (3): The best result

Introduction ST Transform NN-VTS PFC Conclusion

(1) (3) (2)

slide-34
SLIDE 34

34

Outline

➢An Introduction of Robust ASR ➢The 1st proposed method:

Feature Adaptation Using Spectro-Temporal Information

➢The 2nd proposed method (Chapter 3):

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments

➢The 3rd proposed method:

A Particle Filter Compensation Approach to Robust LVCSR

➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion

slide-35
SLIDE 35

35

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments (NN-VTS method)

(B) (A)

Introduction ST Transform NN-VTS PFC Conclusion

YNN

ΛyNN

(A) Noise Normalization: reduce the non-stationary characteristics of additive noise (B) VTS Model Compensation: handle the residual noise

slide-36
SLIDE 36

36

Step 1: Noise Normalization

Instantaneous noise estimate Adding average noise estimate to reduce musical noise Observed noisy feature NN feature

Introduction ST Transform NN-VTS PFC Conclusion

Noise Estimation yt nt n=μn

α

x x

+

+ +

  • ^

yt DCT matrix hyper-parameter α is used to control the degree of removing the instantaneous noise noisy feature

Instantaneous noise estimate Average noise estimate

NN feature

slide-37
SLIDE 37

37

Step 2: Back-end Compensation

(Hyperparameter from noise normalization)

Approximations of Noisy Acoustic Models

Introduction ST Transform NN-VTS PFC Conclusion

Jacobian matrix

Noise Estimation yt λn={μn,σn} λ x VTS Model Compensation [Li2009] λ ^

y

y=g(x ,n) noisy features clean model noisy model

slide-38
SLIDE 38

38

Step 2: Back-end Compensation

Minimal point Residual Noise Variance We expect that alpha=0.5 is the best setting.

Introduction ST Transform NN-VTS PFC Conclusion

slide-39
SLIDE 39

39

Experimental Settings

Introduction ST Transform NN-VTS PFC Conclusion

slide-40
SLIDE 40

40

Results

Word accuracies evaluated on test sets A and B of AURORA2 database

Introduction ST Transform NN-VTS PFC Conclusion

slide-41
SLIDE 41

41

Outline

➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5):

Feature Adaptation Using Spectro-Temporal Information

➢The 2nd proposed method (Ch3):

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments

➢The 3rd proposed method (Ch4):

A Particle Filter Compensation Approach to Robust LVCSR

➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion

slide-42
SLIDE 42

42

Particle Filter Compensation (PFC) Approach to Robust LVCSR

PFC

(for background noise)

(A)

Introduction ST Transform NN-VTS PFC Conclusion

slide-43
SLIDE 43

43

PFC Framework

Input: Speech Features Decoder 1 Phone Sequence

(aligned with input features)

PFC Feature Enhancement Enhanced Speech Features Decoder 2 Text

Introduction ST Transform NN-VTS PFC Conclusion

slide-44
SLIDE 44

44

PFC for Clean Speech Feature Estimation

. . a) Using Single Pass Retraining (SPR) Technique . . b) Using Particle Filter Algorithm

Introduction ST Transform NN-VTS PFC Conclusion

The posterior density of the clean speech features phone /a/

slide-45
SLIDE 45

45

➢ Experiments are conducted on Aurora 4 data ➢ Decoder is from the hidden Markov model toolkit (HTK) ➢ A relative error reduction of only 5.3% is obtained (compared to

multi-condition training GMM-HMM system).

➢ This work has been published in APSIPA, Taiwan, 2013.

Introduction ST Transform NN-VTS PFC Conclusion

Experiments

  • D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter

compensation approach to robust LVCSR. In APSIPA ASC, Taiwan, 2013.

slide-46
SLIDE 46

46

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

Outline

➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5) – The major Contribution:

Feature Adaptation Using Spectro-Temporal Information

➢The 2nd proposed method (Ch3):

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments

➢The 3rd proposed method (Ch4):

A Particle Filter Compensation Approach to Robust LVCSR

➢Conclusions and Future Directions

slide-47
SLIDE 47

47

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

Conclusions

➢ST Transform

➢ Proposed to use EM algorithm to estimate the generalized linear

transform to minimize the cost function based on KL divergence criterion for feature adaptation

➢ Proposed a sparse ST transform: the Cross transform; explore

cascaded transforms and interpolation of statistics

➢NN-VTS

➢ Proposed the integration of noise normalization with VTS model

compensation

➢PFC

➢ Extended the PFC framework to work on LVCSR system

slide-48
SLIDE 48

48

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

Future Directions

➢Discover a sparse transform automatically

by using sparse constraints (e.g. apply L1 norm)

➢Introduce nonlinear hidden nodes into the

transform, similar to a multilayer perceptron

  • r deep neural network

➢Investigate the proposed methods with

existing state of the art DNN acoustic model

slide-49
SLIDE 49

49

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

List of Publications

slide-50
SLIDE 50

50

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

References

slide-51
SLIDE 51

51

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

References

slide-52
SLIDE 52

52

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

References

slide-53
SLIDE 53

53

Introduction NN-VTS PFC for LVCSR ST Transform Conclusion

Thank you very much!

slide-54
SLIDE 54

54

Supplementary Slides