Feature-based Robust Techniques For Speech Recognition
Nguyen Duc Hoang Ha
08-Mar-2017 presented by Supervisors
- Assoc. Prof. Chng Eng Siong
- Prof. Li Haizhou
Feature-based Robust Techniques For Speech Recognition presented by - - PowerPoint PPT Presentation
Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017 Outline An Introduction of Robust ASR The 1 st proposed method (Ch5) The
Nguyen Duc Hoang Ha
08-Mar-2017 presented by Supervisors
2
➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5) – The major Contribution:
➢The 2nd proposed method (Ch3):
➢The 3rd proposed method (Ch4):
➢Conclusions and Future Directions
3
Introduction ST Transform NN-VTS PFC Conclusion
The aim is to decode the speech signal into text.
hello /h e l o/
AM LM w
4
➢Siri (http://www.apple.com/ios/siri/) ➢Amazon Echo
➢Google Speech Recognition API
Introduction ST Transform NN-VTS PFC Conclusion
5
➢Non-native speakers ➢Dialect variations ➢Dis-fluencies ➢Out-of-vocabulary words ➢Language modeling ➢Noise robustness
Introduction ST Transform NN-VTS PFC Conclusion
6
Introduction ST Transform NN-VTS PFC Conclusion
Clean speech model Noisy speech features
7
Introduction ST Transform NN-VTS PFC Conclusion
(A) (B) Two major approaches: (A) Feature-based approach (B) Model-based approach
8
➢Feature-Based Approach ➢Model-based Approach
Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fMLLR [Digalakis1995,Gales1998], ... Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009]
Introduction ST Transform NN-VTS PFC Conclusion (A) (B)
9
Introduction ST Transform NN-VTS PFC Conclusion
(A) (B) (C) Noisy data collection / simulation
10
Robust ASR Data Collection Simulation Feature-based Approach Model-based Approach Clean Feature Estimation (e.g. SS [Boll1979], MMSE [Ephraim1984], ...) Feature Transformation (e.g. fMLLR [Digalakis1995,Gales1998]) … Filtering Approach (e.g. RASTA [Hermansky1994], ...) MAP Model Adaptation [Gauvain1994] MLLR, CMLLR Model Adaptation [Leggetter1995, Gales1998] VTS Model Compensation [Acero2000, Li2009] ... Deep learning approaches (e.g. DNN AM [Hinton2012]) ...
Introduction ST Transform NN-VTS PFC Conclusion
(A) (B) (C)
11
(A3) PFC-LVCSR
(for background noise)
(A1) ST-Transform
(for background noise and reverberation)
(A2)NN – (B2) VTS
(for non-stationary noise)
(2) (A3) (A1)
Introduction ST Transform NN-VTS PFC Conclusion
(A2) (B2)
12
1) Spectra-Temporal Transformation (ST-Transform)
ICASSP, Italy, 2014.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1–1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme)
3) Particle Filter Compensation (PFC) for LVCSR
ASC, Taiwan, 2013.
2) Noise Normalization (NN) – Vector Taylor Series Model Compensation (VTS)
Introduction ST Transform NN-VTS PFC Conclusion
13
http://reverb2014.dereverberation.com/introduction.html
Contributions of ST Transform
14
➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5):
➢The 2nd proposed method (Ch3):
➢The 3rd proposed method (Ch4):
➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion
15
Introduction ST Transform NN-VTS PFC Conclusion
(A1) ST-Transform
16
Introduction ST Transform NN-VTS PFC Conclusion
Noisy features ST Transform
y1:T
Transformed features
^ x1:T
Distribution of transformed features Distribution of training features Kullback–Leibler divergence The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features.
17
Introduction ST Transform NN-VTS PFC Conclusion
Input features Feature Transformation y = f(x)
x1:T
Transformed features
y1:T
Distribution of transformed features Distribution of training features Kullback–Leibler divergence
18
A) e.g. CMN
[Atal1974], MVN [Viikki1998]
B) e.g.
fMLLR [Digalakis1 995,Gales 1998]
C) e.g.
RASTA [Hermansky1994] TSN [Xiao2009] Introduction ST Transform NN-VTS PFC Conclusion
Input: Output:
19
Introduction ST Transform NN-VTS PFC Conclusion
input feature vectors
Input: Output:
20
Introduction ST Transform NN-VTS PFC Conclusion
21
Introduction ST Transform NN-VTS PFC Conclusion
Matrix form of W
22
From L2-Norm From KL-divergence criterion
Introduction ST Transform NN-VTS PFC Conclusion
Output features
Covariance matrix of Output features
23
Introduction ST Transform NN-VTS PFC Conclusion
24
From test data From training or prior data
The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data.
From test data From training or prior data
Introduction ST Transform NN-VTS PFC Conclusion
25
A) e.g. CMN, MVN, HEQ B) e.g. fMLLR C) e.g. RASTA, ARMA, TSN
Introduction ST Transform NN-VTS PFC Conclusion
26
Introduction ST Transform NN-VTS PFC Conclusion
Matrix form of W
27
Introduction ST Transform NN-VTS PFC Conclusion
Matrix form of W
28
➢REVERB Challenge 2014 benchmark task for noisy and
➢Clean condition training scheme:
➢ Training data: 7861 clean utterances from the WSJCAM0 database (about
17.5 hours from 92 speakers)
➢ Speech features:
13 MFCCs + 13 ∆ + 13 ∆∆ MVN post-processing
➢ Acoustic model: 3115 tied-states, 10 mixtures/state ➢ The development (dev) and evaluation (eval) data sets:
➢ Actual meeting room recording of MC-WSJ-AV corpus ➢ Near setting: 100cm distance between the microphone and the speaker ➢ Far setting: 250cm distance between the microphone and the speaker
Introduction ST Transform NN-VTS PFC Conclusion
29
We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments
Introduction ST Transform NN-VTS PFC Conclusion
30
Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance
Introduction ST Transform NN-VTS PFC Conclusion
31
+ Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fMLLR Full Batch Mode Speaker Mode Utterance Mode
58 59 60 61 62 63 64 65 66 67
Cross Transform Temporal Filter ◦ fMLLR FMLLR ◦ Temporal Filter Cross Transform ◦ fMLLR FMLLR ◦ Cross Transform % Average WER
Introduction ST Transform NN-VTS PFC Conclusion
Transform 1 Transform 2 Input features Output features
32
Introduction ST Transform NN-VTS PFC Conclusion
utt1, utt2, …, uttN Transform 1 in full batch mode utt1a, utt2a, …, uttNa
Transform 2 in utterance mode Transform 2 in utterance mode Transform 2 in utterance mode
... utt1b, utt2b, …, uttNb
33
+ Full batch mode (fb): deal with session- wise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and back- ground noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode Observations: + The combination of batch and utterance mode transforms performs the best. + (1) vs (2): 3 % absolute reduction in WER + (3): The best result
Introduction ST Transform NN-VTS PFC Conclusion
(1) (3) (2)
34
➢An Introduction of Robust ASR ➢The 1st proposed method:
➢The 2nd proposed method (Chapter 3):
➢The 3rd proposed method:
➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion
35
(B) (A)
Introduction ST Transform NN-VTS PFC Conclusion
YNN
(A) Noise Normalization: reduce the non-stationary characteristics of additive noise (B) VTS Model Compensation: handle the residual noise
36
Instantaneous noise estimate Adding average noise estimate to reduce musical noise Observed noisy feature NN feature
Introduction ST Transform NN-VTS PFC Conclusion
Noise Estimation yt nt n=μn
x x
+ +
yt DCT matrix hyper-parameter α is used to control the degree of removing the instantaneous noise noisy feature
Instantaneous noise estimate Average noise estimate
NN feature
37
(Hyperparameter from noise normalization)
Approximations of Noisy Acoustic Models
Introduction ST Transform NN-VTS PFC Conclusion
Jacobian matrix
Noise Estimation yt λn={μn,σn} λ x VTS Model Compensation [Li2009] λ ^
y
y=g(x ,n) noisy features clean model noisy model
38
Minimal point Residual Noise Variance We expect that alpha=0.5 is the best setting.
Introduction ST Transform NN-VTS PFC Conclusion
39
Introduction ST Transform NN-VTS PFC Conclusion
40
Word accuracies evaluated on test sets A and B of AURORA2 database
Introduction ST Transform NN-VTS PFC Conclusion
41
➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5):
➢The 2nd proposed method (Ch3):
➢The 3rd proposed method (Ch4):
➢Conclusions and Future Directions Introduction ST Transform NN-VTS PFC Conclusion
42
PFC
(for background noise)
(A)
Introduction ST Transform NN-VTS PFC Conclusion
43
Input: Speech Features Decoder 1 Phone Sequence
(aligned with input features)
PFC Feature Enhancement Enhanced Speech Features Decoder 2 Text
Introduction ST Transform NN-VTS PFC Conclusion
44
. . a) Using Single Pass Retraining (SPR) Technique . . b) Using Particle Filter Algorithm
Introduction ST Transform NN-VTS PFC Conclusion
The posterior density of the clean speech features phone /a/
45
➢ Experiments are conducted on Aurora 4 data ➢ Decoder is from the hidden Markov model toolkit (HTK) ➢ A relative error reduction of only 5.3% is obtained (compared to
➢ This work has been published in APSIPA, Taiwan, 2013.
Introduction ST Transform NN-VTS PFC Conclusion
compensation approach to robust LVCSR. In APSIPA ASC, Taiwan, 2013.
46
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
➢An Introduction of Robust ASR ➢The 1st proposed method (Ch5) – The major Contribution:
➢The 2nd proposed method (Ch3):
➢The 3rd proposed method (Ch4):
➢Conclusions and Future Directions
47
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
➢ST Transform
➢ Proposed to use EM algorithm to estimate the generalized linear
➢ Proposed a sparse ST transform: the Cross transform; explore
➢NN-VTS
➢ Proposed the integration of noise normalization with VTS model
➢PFC
➢ Extended the PFC framework to work on LVCSR system
48
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
➢Discover a sparse transform automatically
➢Introduce nonlinear hidden nodes into the
➢Investigate the proposed methods with
49
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
50
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
51
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
52
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
53
Introduction NN-VTS PFC for LVCSR ST Transform Conclusion
54