SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Feature Representation Learning in Deep Learning Networks - - PowerPoint PPT Presentation
Feature Representation Learning in Deep Learning Networks - - PowerPoint PPT Presentation
ASR Chapter 9: Feature Representation Learning in Deep Learning Networks SNU Spoken Language Processing Lab /
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Abstract for Chapter 9
- Deep neural networks that jointly learn the feature representation and the classifier.
- Through many layers of nonlinear processing, DNNs transform the raw input feature to a
more invariant and discriminative representation that can be better classified by the log- linear model.
- DNNs learn a hierarchy of features.
- The lower-level features typically catch local patterns. These patterns are very sensitive to
changes in the raw feature.
- The higher-level features are built upon the low-level features and are more abstract and
invariant to the variations in the raw feature.
- We demonstrate that the learned high-level features are robust to speaker and
environment variations.
2
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
ASR Chapter 9:
Feature Representation Learning in Deep Learning Networks (Part 1)
조성재 협동과정 인지과학전공
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
목차
- 9.1 Joint Learning of Feature Representation and Classifier
- 9.2 Feature Hierarchy
- 9.3 Flexibility in Using Arbitrary Input Features
4
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- Deep vs. shallow models
- Deep models: DNNs
- Shallow models: GMM, SVM
- Comparing performance of the models in speech recognition
- DNN > GMM
- DNN > SVM
- Why?
- Because the DNNs are able to learn complicated feature representations and classifiers jointly.
5
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- Feature engineering
- In the conventional shallow models (GMMs, SVMs), feature engineering is the key to the succes
s of the system.
- Practitioner’s main job is to construct features that perform well.
- Better features often come from someone who has great domain knowledge.
- Examples of feature sets from feature engineering
- SIFT: scale-invariant feature transform
- In image recognition
- MFCC: mel-frequency cepstrum coefficients
- In speech recognition
6
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- Deep models such as DNNs, however, do not require hand-crafted high-level features.
- Good raw features still help though since the existing DNN learning algorithms may generate a
n underperformed system.
- DNNs automatically learn the feature representations and classifiers jointly.
7
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- In the DNN, the combination of all hid
den layers can be considered as a feat ure learning module.
- The composition of simple nonlinear tr
ansformations results in very complicat ed nonlinear transformation.
- The last layer
- = a softmax layer
- = a simple log-linear classifier
- = a maximum entropy (MaxEnt) model
8
- Fig. 9.2 DNN: A joint feature representation and classifier
learning view
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- In DNN, the estimation of the posterior probability 𝑞(𝑧 = 𝑡|𝐩), where (𝑡: target class, 𝐩:
- bservation vector), can be considered[=interpreted] as a two-step nonstochastic process:
- Step 1: Transformation 𝐩 → 𝐰 → 𝐰𝟑 → ⋯ → 𝐰𝑀−1
- Step 2: 𝑞(𝑧 = 𝑡|𝐩) is estimated using the log-linear model.
- “log-linear model”: (https://en.wikipedia.org/wiki/Log-linear_model)
- exp 𝑑 + σ𝑗 𝑥𝑗𝑔
𝑗 𝑌
← exp 𝑑 + σ𝑗 𝑥𝑗
𝑀 𝐰𝑀−1
» 𝑌: variables » 𝑔
𝑗 𝑌 : quantities that are functions of the variable 𝑌
» 𝑑, 𝑥𝑗: model parameters
9
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- A MaxEnt model [25] is a model that estimates the optimal model parameter 𝑞∗ with the
following scheme.
- 𝑞∗ = arg max
𝑞∈𝑄 𝐼 𝑞 - maximization
- 𝐼 𝑞 = − σ𝑦 𝑞 𝑦 log 𝑞 𝑦
- entropy
- The last layer of the DNN becomes a MaxEnt model because of the softmax layer.
- 𝑞 𝑦 ← 𝑞 𝑧 = 𝑡 𝐩 =
exp 𝑨𝑡 σ𝑙=1
𝐿
exp 𝑨𝑙
- In the conventional MaxEnt model, features are manually designed.
- The MaxEnt model of the DNN is automatically designed.
10
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier
- Manual feature construction works fine
- for tasks (group 1) that people can easily inspect and know what feature to use
- but not for tasks (group 2) whose raw features are highly variable.
- Group 2: speech recognition
- In DNNs, however, the features
- are defined by the first 𝑀 − 1 layers and
- are jointly learned with the MaxEnt model from the data automatically.
- DNNs eliminate the tedious manual feature construction.
- DNNs have the potential of extracting good (= invariant and discriminative) features, wh
ich are impossible to construct manually.
11
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.1 Joint Learning of Feature Representation and Classifier – Questions (O)
딥뉴럴넷이 feature representation learning을 잘한다는 걸 이론적으로 밝힌 논문을 소개해주시면 감사하겠습니다. – 변석현
- The universal approximation theorem (Hornik et al., 1989 Cybenko 1989):
- “Regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.”
- → DNNs are good at learning a complex function.
- A deep neural net have the hierarchical composition of features.
- In a neural nets, weighted sum and activation functions combine features to generate a feature in the next layer.
- Using deeper models can reduce the number of units required to represent the desired function.
- [참고문헌] Goodfellow, I., Bengio, Y., and Courville, A. C. (2016). “Universal Approximation Properties and Depth.”
Deep Learning. The MIT Press. pp.197-200.
- Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Net
works , 2, 359–366.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems
, 2, 303–314.
- Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks— studies on speech recognition
- tasks. In: Proceedings of the ICLR (2013)
12
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- DNNs learn feature representations that are suitable for the classifier.
- DNNs learn a feature hierarchy.
- What is a feature hierarchy?
- Feature hierarchy: Raw input feature → low-level features → higher-level features
- Low-level features catches local patterns.
- Local patterns are very variant/sensitive to changes in the input features.
- Higher-level features are built on the low-level features
- More abstract than the low-level features
- Invariant/robust to the input feature variations
13
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- The feature hierarchy learned from imageNet dataset
14
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- A saturated neuron is satisfied that 𝑤𝑗
𝑚 < 0.01 or 𝑤𝑗 𝑚 < 0.99.
- The lower layers: small percentage of saturated neurons
- The higher layers: large percentage of saturated neurons
- [<0.01] The training label is a one-hot vector. → The training label is sparse. → The associ
ated features are sparse. → Majority of the saturated neurons are deactivated.
15
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- The magnitude of the majority of the we
ights is typically very small.
- The magnitude of 98% of the weights in
all layers except the input layer (layer 1) is less than 0.5.
16
Near 0.5
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- Element 𝐰𝑚+1 • 1 − 𝐰𝑚+1
≤ 0.25. (The range of the derivative of the sigmoid)
- As shown in Fig.9.4, a large percentage of hidden neurons are inactive.
- Thus, for most elements, Element 𝐰𝑚+1 • 1 − 𝐰𝑚+1
≪ 0.25. – fact 1
- As shown in Fig.9.5, the magnitude of the majority of the weights in 𝐗𝑚+1 is very small.
- Thus, for 98% elements, Element 𝐗𝑚+1
2 < 0.5. – fact 2
- As show in Fig.9.6, average diag 𝐰𝑚+1 • 1 − 𝐰𝑚+1
𝐗𝑚+1 𝑈
2 is around 0.5. – fact 3
- This means that mostly, 𝜀𝑚
2 >
𝜀𝑚+1
2 (based on fact 1, fact 2, and fact 3),
- Features generated by higher layers are more invariant to variations than those represented by lower layers.
17
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- The maximum norm over the same development set is larger than one.
- This large norm enlarges the differences around the class boundaries to have discrimination a
bility.
- Robust to noise, sensitive to signal.
- [?] These large norm cases also cause noncontinual points on the objective function.
- As indicated by [29].
18
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy
- A general framework of feature processing
- Normalization: Average removal 𝑌 ← 𝑌 − 𝐹 𝑌 ,
local contrast normalization 𝑌𝑚𝑝𝑑𝑏𝑚 ← 𝑌𝑚𝑝𝑑𝑏𝑚−𝐹 𝑌𝑚𝑝𝑑𝑏𝑚
𝜏𝑚𝑝𝑑𝑏𝑚
, variance normalization (a computationally efficient normalization technique for robust speech recognition)
- Filter bank: To project the feature to a higher dimensional space (input feature → filters for local patterns)
- Non-linear: The serial nonlinear transformations. To make complicated decision boundaries
- Pooling: To extract invariant feature. To reduce the dimension
19
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
(1) Figure 9.6이 무엇을 보여주는 것인지 궁금합니다. (2) 그리구 이 그림 설명에 써있는 수식에 대해서도 설명해주실수있나요? (3) 9.3에서 PLP feature에 대해서 나오는데, 무엇인지와 MFCC 와 어떻게 다른지 궁금합니다. – 장한솔
- (1) Equation 9.3에서
𝜀𝑚 변화에 따른 𝜀𝑚+1 의 변화를 결정하는 계수를 각 layer 𝑚에 따라 나타낸 것입니다.
- (2) diag 𝜏′ 𝐴𝑚+1 𝐰𝑚
𝐗𝑚+1 𝑈 은 두 함수의 차이를 미분계수로 근사한 것입니다.
- 𝜏 𝐴𝑚+1 𝐰𝑚 + 𝜀𝑚
− 𝜏 𝐴𝑚+1 𝐰𝑚 ≈ diag 𝜏′ 𝐴𝑚+1 𝐰𝑚 𝐗𝑚+1 𝑈𝜀𝑚
- 𝜏 𝐴𝑚+1 𝐰𝑚+𝜀𝑚
−𝜏 𝐴𝑚+1 𝐰𝑚 𝜀𝑚
≈
𝜖𝜏 𝐴𝑚+1 𝐰𝑚 𝜖𝐰𝑚
= diag 𝜏′ 𝐴𝑚+1 𝐰𝑚 𝐗𝑚+1 𝑈
20
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
(1) Figure 9.6이 무엇을 보여주는 것인지 궁금합니다. (2) 그리구 이 그림 설명에 써있는 수식에 대해서도 설명해주실수 있나요? (3) 9.3에서 PLP feature에 대해서 나오는데, 무엇인지와 MFCC와 어떻게 다른지 궁금합니다. – 장한솔 (3)
21
PLP (Perceptual Linear Prediction) MFCC (Mel Frequency Cepstral Coefficients) 공통점
- Two most often used raw features in the speech recognition applications
- Feature extraction from raw inputs
차이점
- Analysis procedure motivated by human speech
perception
- Cepstral analysis
- Heuristically made
- sensitivity to noise
논문
- H. Hermansky, 1990, “Perceptual linear predictive_P
LP . analysis of speech,” J. Acoust. Soc. Amer., (87), N
- . 4, 1738–1752.
S.B. Davis and P . Mermelstein, 1980, “Comparison of parametric representations for monosyllabic word r ecognition in continuously spoken sentences,” IEEE
- Trans. Acoust. Speech Signal Process., (28), No. 4, 3
57– 366.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
22
DCT: the Discrete Cosine Transform Fig : MFCC : Complete Pipeline MFCC
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
160쪽에서 higher-level feature는 input variance에 대해서 invariant하다는 설명을 수식으로 전개하고 있습니다. 이 수식에서는 sigmoid 를 activation function으로 설명하고 있는데 다른 activation function에도 같은 식으로 설명이 가능할까요? 최근에 ReLU가 많이 쓰이고 있는데 activation function을 ReLU로 바꾼다면 동일하게 전개되는지 궁금합니다. – 최성호
- 𝜀𝑚+1
≤ diag 𝜏′ 𝐴𝑚+1 𝐰𝑚 𝐗𝑚+1 𝑈 𝜀𝑚
- Fig.9.5에서 보듯이 𝐗𝑚+1의 약 95%의 원소 𝑥𝑗
𝑚+1들 크기가 0.5보다 작다.
- ReLU such that 0 ≤ 𝜏′ =
𝑒 𝑒𝑦 ReLU 𝑦 ≤ 1. ReLU의 도함수의 값은 0 혹은 1이므로
diag 𝜏′ 𝐴𝑚+1 𝐰𝑚 𝐗𝑚+1 𝑈 의 원소는 0 이거나 0.5보다 작다.
- 따라서, 𝜀𝑚+1 의 원소는
𝜀𝑚 보다 평균적으로 작다. ReLU도 같은 방식으로 설명이 가능하다.
- 자주 쓰이는 activation function인 tanh, ELU, LeakyReLU 도 마찬가지로 도함수의 값의 범위가 [0, 1]에 포함되기 때문에 같은 방식
으로 설명이 가능하다.
- tanh such that 0 ≤ 𝜏′ = 𝑒
𝑒𝑦 tanh 𝑦 ≤ 1.
- ELU 𝑦 = ቊ𝑦, 𝑦 > 0
𝛽 exp 𝑦 − 1 , 𝑦 ≤ 0 (0 < 𝛽 < 1) such that 0 < 𝜏′ = 𝑒
𝑒𝑦 ELU 𝑦 ≤ 1.
- LeakyReLU 𝑦 = ቊ𝑦, 𝑦 > 0
𝛽𝑦, 𝑦 ≤ 0 (0 < 𝛽 < 1) such that 0 < 𝜏′ = 𝑒
𝑒𝑦 LeakyReLU 𝑦 ≤ 1.
23
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
p.159 에서 hierarchical feature 들이 나옵니다. lower-level feature 들은 input feature 들에 sensitive 하고 higher-level feature 들은 덜 sensitive 하다고 하는데요, 종종 deep learning paper 들을 읽어보 면 pre-train 된 network 에서 (1) 마지막 layer의 결과값이 아니라 그 전의 layer들 중 특정 layer의 fe ature 들을 이용하는 경우가 있습니다. (1) 그러한 것이 input feature에 sensitive 하지 않은 특성 때 문인지 궁금하고 (2) speech recognition에서도 이러한 경우가 있는지 궁금합니다. – 김준호
- (1) 마지막 layer를 제거하는 이유는 단순히 그 layer가 특정 classification task을 수행하기 때문에
다른 task에 이용하기 위해 제거합니다.
- (2) 예전에는 WaveNet이 요즘에는 DeepSpeech가 대표적으로 speech recognition에 쓰이는 pre-
trained model 입니다. 아래의 논문은 speech recognition을 위한 좋은 pre-trained model에 대 한 연구 사례입니다.
- G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for L
arge-Vocabulary Speech Recognition," in IEEE Transactions on Audio, Speech, and Language Processi ng, vol. 20, no. 1, pp. 30-42, Jan. 2012.
24
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
p.161 These large norm cases also cause noncontinual points on the objective function라고 나오는데요, 왜 noncontinual points on the objective function인지 잘 이해가 되지 않습니다. 조금만 더 쉽게 설명 부탁드립니다. – 육지은
- 발표자도 확실하게 이해하지 못한 부분. 아래는 발표자가 해석한 내용.
- 여기서 문제는 minimization 문제라고 생각하자.
- Noncontinual ≠ noncontinuous. Noncontinual points는 검색이 되지 않는 구.
- [26]에 따르면 large norm weights는 input 𝐲 의 small deviation 𝜺 에도 틀린 label prediction
을 하기 쉬움. 즉, Objective function space 𝐾 𝐲, 𝐗 에서 𝐗 이 크면, 𝐲가 조금 변화했을 때 (𝐲 ← 𝐲 + 𝜺), 𝐾 𝐲 + 𝜺, 𝐗 가 급격하게 증가하게 됨. 이런 point를 noncontinual point이라고 생 각함.
25
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.2 Feature Hierarchy – Questions (O)
159p에서 imagenet에서 추출한 이미지 정보에 관한 low-level feature와 high-level feature를 비교해 보여주고 있으며, 전자는 input-variant하고 후자는 덜 그러하다고 언급하고 있는데, ASR에서도 동일한 메카니즘이 사용된다면 high/low-level feature로 추출되는 정보의 예시화 한다면 어떤것들이 있을까요? – 김낙훈
- 159쪽 Fig.9.3은 9.2의 첫 번째 문단의 내용을 시각적으로 보여주기 위한 그림입니다. Low-
level feature는 input-variant하고 high-level feature는 덜 그런 것은 9.2의 두 번째 문단부터 ASR을 예시로 서술됩니다.
- 더 자세히 알아보시려면 참고 문헌 [33]을 참고하시면 됩니다.
- [33] Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks— s
tudies on speech recognition tasks. In: Proceedings of the ICLR (2013)
26
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.3 Flexibility in Using Arbitrary Input Features
- MFCC: 13 features. Most often used raw features. Derived from the MS-LFB.
- fMPE, BMMI: training methods that makes the MFCC features better
- MS-LFB: The Mel-scaled log filter-bank features. Has more raw features than MFCC does.
- 24 MS-LFB: 24 features
- 29 MS-LFB: 29 features
- 40 MS-LFB: 40 features
27
Comparing DNN with GMM More raw features Performance getting better More features Baselined model
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.3 Flexibility in Using Arbitrary Input Features
- rel. WERR = relative Word Error Rate Reduction
- The concept derived from relative change, which is defined as follows:
RelativeChange 𝑦, 𝑦ref = 𝑦−𝑦ref
𝑦ref
.
- This means change of the target 𝑦 from 𝑦ref relative to the reference 𝑦ref.
- In this table, 𝑦ref = 34.66% and 𝑦 becomes a WER to compare. ‘Reduction’ is coined from ‘change.’
- Example
- −8.7 % = 31.63−34.66
34.66
- −13.1 % = 30.11−34.66
34.66
- −13.1 % = 30.11−34.66
34.66
- −13.8 % = 29.86−34.66
34.66
28
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.3 Flexibility in Using Arbitrary Input Features
- The filter-banks can be automatically lear
ned by the DNNs as reported in [26], in which FFT spectrum is used as the input t
- the DNNs.
- FFT: data preprocessing
- It is reported in [26] that by learning the f
ilter-bank parameters directly / 5% relativ e WER reduction can be achieved / over t he baseline DNNs that use the manually designed Mel-scale filter-banks (40 MS-L FB as shown in [26]).
29
[26] Sainath et al. “Learning filter banks within a deep neural network framework.” In Proceedings of the IEEE Workshop on Automatic Speech R ecognition and Understanding (ASRU). 2013.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.3 Flexibility in Using Arbitrary Input Features – Questions (O)
9.3에 HLDA transform에 대해서 간략히 설명 부탁드립니다. p166에 NAT, VTS, MLLR, 등에 대 해서도 간략히 추가 설명 부탁드립니다. – 김종인
- HLDA transform = Heteroscedastic Linear Discriminant Analysis transform
- 문맥에서 HLDA transform이 뜻하는 바:
- 확률 기반 speech recognition 모델을 사용할 때, MFCC에 HLDA transform 를 적용하면 성
능이 향상됩니다.
- Reference [13] 에 HLDA transfor에 대한 자세한 설명이 있습니다.
- [13] Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs
for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
30
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.3 Flexibility in Using Arbitrary Input Features – Questions (O)
table 9.1에서 raw input feature를 사용한 결과에 대한 비교를 하고 있습니다. 아직까지는 feature를 만들어 사용하는 것이 더 좋은 결과를 나타내는 것 같습니다. feature를 사람이 만들어 넣어주는 것이 아니라 완전한 raw data를 그대로 사용하는 방법에 대 한 연구도 진행되고 있는지 궁금합니다. – 조석현
- Section 9.3 마지막 문단이 DNN의 input을 MS-LFB와 raw input를 이용해서 비교한 실험이
소개되었습니다 [26]. 이 실험에서 raw input을 사용했을 때 DNN이 더 좋은 성능을 냈다고 합니다.
- 이때, raw input은 FFT로 변환한 input입니다.
- 자세한 내용은 [26] 참고.
31
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
ASR Chapter 9:
Feature Representation Learning in Deep Learning Networks (Part 2)
강기천 협동과정 인지과학전공
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Contents
- 9.4 Robustness of Features
- 9.4.1 Robust to Speaker Variations
- 9.4.2 Robust to Environment Variations
- 9.5 Robustness Across All Conditions
- 9.5.1 Robustness Across Noise Levels
- 9.5.2 Robustness Across Speaking Rates
- 9.6 Lack of Generalization Over Large Distortions
33
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.4 Robustness of Features
- A key property of a good feature is its robustness to the variations.
- two main types of variations in speech signals
- speaker variation
- environment variation.
34
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 9.4.1 Robust to Speaker Variations
- Vocal Tract Length Normalization (VTLN)
- People have different vocal tract (from lips to larynx)
- Speech signal to frequency, frequency warping using warping factor
- Band-pass filter integration using filter bank method
35
9.4.1 Robust to Speaker Variations
Speech signal Speaker-dependent factor (0.8 ~ 1.18)
https://www.lti.cs.cmu.edu/sites/default/files/CMU-LTI-97-150-T.pdf
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 9.4.1 Robust to Speaker Variations
- Feature-Space Maximum Likelihood Linear Regression (fMLLR)
- Speaker adaptation technique
- fMLLR applies an affine transform to the feature vector so that the transformed feature better
matches the model
- For GMM-HMMs, fMLLR transforms are estimated to maximize the likelihood of the adaptation data
36
9.4.1 Robust to Speaker Variations
Table 9.2 Comparison of feature-transform-based speaker-adaptation techniques for GMM-HMMs, a shallow, and a deep NN
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 9.4.2 Robust to Environment Variations
- Similarly, GMM-based acoustic models are highly sensitive to environmental mismatch.
- To deal with the issue several techniques, such as vector Taylor series (VTS) adaption and maximum
likelihood linear regression (MLLR), that normalize the input features or adapt the model parameters have been developed.
- DNNs have the ability to generate internal representations that are robust to environmental
variability seen in the training data.
37
9.4.2 Robust to Environment Variations
Noisy Speech Clean Speech Noise
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 9.4.2 Robust to Environment Variations
- Table 9.3 indicates that DNN system are more robust than GMM systems to certain variation
- Four types of distortion, Aurora 4 task (six types of noise)
38
9.4.2 Robust to Environment Variations
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- The “DNN (7 × 2K)” system is simply a direct application of the CD-DNN-HMM with 7 hidden
layers each with 2K neurons.
- Nevertheless, it outperforms all but the “NAT+JointMLLR/VTS” system.
- Finally, the “DNN+NaT+dropout” system that uses the noise-aware training and dropout has
the best performance.
- In addition, all the DNN-HMM results were obtained in the first pass, while the other three
systems required two or more recognition passes for noise, channel, or speaker adaptation.
39
9.4.2 Robust to Environment Variations
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Q&A
- Q : NAT, NaT, VTS에 대해서 설명 부탁드립니다.
- A :
NAT(Noise Adaptive Training) : Clean speech의 point estimate에 의존하지 않고, pseudo-clean model parameter를 직접적으로 추정하는 방식. NaT(Noise Aware Training) : NAT과 같이 명시적으로 adaptation을 시키는 것이 아닌, DNN을 이용 하여 noise speech와 clean speech의 관계를 학습하는 방법 VTS(Vector Taylor Series) : Taylor Series를 이용하여 noisy speech의 pdf(probability density function)를 추정하는 알고리즘. 추정된 noise model은 가우시안 파라미터를 adaptation하는데 쓰인다.
40 https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2201112/Ozlem_ICASSP09_final.pdf https://pdfs.semanticscholar.org/b11e/4d980f46524e6838b8211afda081965dd25a.pdf https://pdfs.semanticscholar.org/b11e/4d980f46524e6838b8211afda081965dd25a.pdf
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Q&A
- Q : DNN이 GMM보다 noise에 대해서 더 robust하다는 설명을 하고 있는데, 그렇다면
Denoising autoencoder와 같은 방법을 통해서 noise를 제거하고 feature representation 학습을 진행하는 것이 더 좋은 성능을 보이지 않을지 궁금합니다.
- A : 교재에서 제시한 태스크와 직접적으로 비교한 논문은 찾지 못함
41 https://groups.csail.mit.edu/sls/publications/2014/feng-icassp14.pdf
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.5 Robustness Across All Conditions
- The superior robustness results reported in Sect. 9.4 seem to suggest that DNNs
provide significantly higher error rate reduction over GMM systems for noisier speech than for cleaner speech.
- In fact, these results only indicate that the DNN systems are more robust than GMM
systems to speaker and environmental distortions.
- In this section, we show that DNNs provide similar gains over GMM systems across
different noise levels and speaking rates.
- Robustness Across Noise Levels
- Robustness Across Speaking rates
42
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.5.1 Robustness Across Noise Levels
- Figures 9.8 and 9.9, provided in [8], compare the error pattern of the GMM-HMM and
CD-DNN-HMM models under different signal-to-noise ratios (SNRs) for the VS(Voice Search) and SMD(Short Message Dictation) datasets respectively.
- As we can observe from these tables, the CD-DNN-HMM significantly outperforms the
GMM-HMM at all SNR levels, including both the clean and very noisy speech.
43
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.5.1 Robustness Across Noise Levels
- CD-DNN-HMM is more robust than GMM systems
- with less WER increment per 1 dB SNR drop on average and slightly more so at the low SNR
range as indicated by the flatter slope compared to GMM systems.
- However, the difference is very small.
- The speech recognition performance of the DNN still drops quite a lot as the noise level
increases within the normal range of the mobile speech applications.
- Noise robustness remains an important research area
- techniques such as speech enhancement, noise robust acoustic features, or other multi-
condition learning technologies need to be explored to bridge the performance gap and further improve the overall performance of the deep learning-based acoustic model.
44
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.5.2 Robustness Across Speaking Rates
- Speaking rate variation is another well known factor that would affect the speech
intelligibility and thus the speech recognition accuracy.
- speaking rate change can be due to different speakers, speaking modes, and speaking styles.
- several reasons that speaking rate change may result in speech recognition accuracy degradation.
- First, it may change the acoustic score dynamic range since the AM score of a phone is the sum of all the
frames in the same phone segment.
- Second, the fixed frame rate, frame length, and context window size may be inadequate to capture the
dynamics in transient speech events for fast or slow speech and therefore result in suboptimal modeling.
- Third, variable speaking rates may result in slight formant shift due to the human vocal instrumentation
limitation.
- Last, extremely fast speech may cause formant target missing and phone deletion.
45
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.5.2 Robustness Across Speaking Rates
46
- Fig. 9.10 Performance comparison of
theGMM-HMMand theCD-DNN-HMMat different speaking rate for the VS task. (Figure from Huang et al. permitted to use by ISCA.)
- Fig. 9.11 Performance comparison of
theGMM-HMMand theCD-DNN-HMMat different speaking rate for the SMD task. (Figure from Huang et
- al. permitted to use by ISCA.)
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Q&A
- Q : 9.5.2에서 제시한 speaking rate의 정의가 무엇이며, 일반적으로 어떤 값을 가지는지
- A : 화자의 발화속도. measured as “the number of phones per second”
일반적으로, 초당 10~17개의 phoneme을 말함
47 https://asa.scitation.org/doi/10.1121/1.2016208 https://www.sciencedirect.com/science/article/pii/S0167639398000697
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- In Sect. 9.2, we have shown that small perturbations in the input will be gradually
shrunk as we move to the internal representation in the higher layers.
- This property leads to robustness of the DNN systems to the speaker and environment
variations as shown in Sect. 9.4.
- In this section, we point out that the above result is only applicable to small
perturbations around the training samples.
- When the test samples deviate significantly from the training samples, DNNs cannot accurately
classify them.
- In other words, DNNs must see examples of representative variations in the data during
training in order to generalize to similar variations in the test data. This is no different from
- ther machine learning models.
48
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- This behavior can be demonstrated using a mixed-bandwidth ASR study. Typical speech
recognizers are trained on either narrowband speech signals, recorded at 8 kHz, or wideband speech signals, recorded at 16 kHz.
- the input to the DNN is the 29 mel-scale log filter-bank outputs together with dynamic
features across an 11-frame context window.
49
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- The 29-dimensional filter-bank has two parts: the first 22 filters span 0–4kHz and the last 7
filters span 4–8 kHz, with the center frequency of the first filter in the higher filter-bank at 4 kHz.
- When the speech is wideband, all 29 filters have observed values.
- However, when the speech is narrowband, the high-frequency information was not captured so
the final 7 filters are set to 0.
- Experiments were conducted on a mobile voice search (VS) corpus.
- This task consists of internet search queries made by voice on a smartphone
- There are two training sets, VS-1 and VS-2, consisting of 72 and 197 h of wideband audio data,
respectively.
- These sets were collected during different times of year. The test set, called VS-T, has 26,757 words in
9,562 utterances.
- The narrow band training and test data were obtained by downsampling the wideband data.
50
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- Table 9.4, extracted from summarizes the WER on the wideband and narrowband test
sets when the DNN is trained with and without narrowband speech.
- From this table, we can observe that if all training data are wideband, the DNN performs well
- n the wideband test set (27.5% WER) but very poorly on the narrowband test set (53.5% WER).
- However, if we convert VS-2 to narrowband speech and train the same DNN using mixed-
bandwidth data (second row), the DNN performs very well on both wideband and narrowband speech.
51
Table 9.4 Word error rate (WER)
- n wideband (16 k) and
narrowband (8 k) test sets with and without narrowband training data
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- To understand the difference between these two scenarios, measure the Euclidean
distance between the activation vectors at each layer for the wideband and narrowband input feature pairs, v(xwb) and v(xnb),
- For the top layer, whose output is the senone posterior probability, we calculate the
KL divergence in nats between p(s j |xwb) and p(s j |xnb)
- where NL is the number of senones, and s j is the senone id.
52
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
9.6 Lack of Generalization Over Large Distortions
- the average distances in the data-mixed DNN are consistently smaller than those in
the DNN trained on wideband speech only.
- The final representation is thus more invariant to this variation and yet still has the ability to
distinguish between different class labels.
53
Table 9.5 Euclidean distance for the activation vectors at each hidden layer (L1–L7) and the KL divergence (nats) for the posteriors at the softmax layer between the narrowband (8 kHz) and wideband (16 kHz) input features, measured using the wideband DNN or the mixed- bandwidth DNN
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Q&A
- Q : 교재 9.6에서 큰 왜곡에 대한 일반화의 부족함을 언급하고, mixed-bandwidth speech
recognition을 예로 들었는데요, 개인적인 생각으로는, sampling rate가 달라져도 사람이 음성을 인식하는데 별 차이를 느끼지 못하듯 ASR에서 bandwidth의 차이가 큰 왜곡이라고 보여지지는 않습니다. 오히려 wideband speech를 down-sampling 하여 얻을 수 있는 narrowband speech는 교재에서 언급한 small perturbations에 가깝다고 생각되는데요, 혹시 교재에서 나온 내용에 대해 제가 잘 못 이해한 부분이 있을까요? 아니라면 큰 왜곡에 대한 일반화 부족의 예로 좀 더 적절한 사례가 있을까요?
54
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Q&A
- A : 논문의 표현을 빌리면, 각 bandwidth는 서로 관련없는 variation이며 significantly deviate됌
9.6의 요지는 학습 데이터가 examples of representative variation을 가지고 있어야 좋은 성능을 낸다는 것임. 이러한 관점에서 본다면 대역이 다른 두 종류의 데이터를 모두 학습하는 것은 데이터에 충분한 표현력을 제공하는 것이며, DNN으로 하여금 서로 다른 variation을 구분할 수 있게 해준다. 다만, 인간이 음성의 bandwidth에 robust한 이유에 대해서는 찾지 못함.
55
<논문 中> Perhaps what is more interesting is that the average distances and variances in the data data-mixed DN DNN are consistently smaller than those in the DNN trained on wideband speech only. Thi
his indic icates th that by us using mixed-bandwid idth tra trainin ing data, th the DNN lea learns to to consid ider th the diff fferences in th the wideband and narrowband input fe features as irrelevant va varia
- iations. These variations are
suppressed after many layers of nonlinear transformation.
https://arxiv.org/pdf/1301.3605.pdf
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
End
Thank you!
56