1
Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - - PowerPoint PPT Presentation
Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - - PowerPoint PPT Presentation
Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers
2
Overview
- JHU HLTCOE submission to LRE15
– One of top performers
- Approach
– DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation
3
Outline
- System design
– i-vector LID system – Improvement with DNNs – Alternative i-vectors
- LRE15 task
– Data usage/augmentation
- Results and analysis
4
LID System
- Two-covariance model in i-vector space
- Discriminative refinement of Gaussians
– Within-class covariance scale factor – Language class means – Multiclass MMI
Gaussian scoring i-vector extractor LDA
MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED
Compute stats
5
LID System
Gaussian scoring i-vector extractor LDA
MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED
Compute stats GMM Compute stats Frame posteriors MFCC
6
LID System
Gaussian scoring i-vector extractor LDA
MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED
Compute stats STATS DNN Compute stats Frame posteriors MFCC ASR features
7
DNN Architecture
- Input 9 spliced frames of 40 dim vectors from LDA+MLLT
- 5 hidden layers with p-norm pooling (p=2)
– Input/output ratio of 10:1
- Output targets are clustered phone states (senones)
- Trained on SWB-1 using Kaldi
9184 dim Bottleneck goes here
8
Types of i-vectors
- Acoustic:
– Gaussian probability model (given alignments) – i-vector: analytical solution for MAP estimate – EM estimation of T
- Phonotactic:
– Multinomial (categorical) probability model – No closed form MAP solution: Newton’s method – Iterative approximate estimation of T
i i
Tw m m + =
i-vector
) (log
i i
softmax Tw p p + =
i-vector
9
How to Combine?
- Score fusion
- I-vector stacking
- Joint i-vector:
i m i
w T m m + = ) (log
i w i
softmax w T p p + =
i-vector
10
Joint I-vector Details
- Based on subspace GMM approach
- Differences
– MAP instead of ML i-vector estimate – Initialize i-vector with acoustic only (closed-form) – Diagonal Hessian in Newton update – Computation: similar to acoustic (was 10x)
11
LRE15 Task
- Conversational speech
– Telephone and broadcast narrowband
- 20 languages, 6 confusable clusters
– Metric: average Bayes cost (Cavg)
- Limited training condition
– Use distributed material only – Switchboard (English) + transcriptions – Variable amount per language
12
LRE15 Systems
- All use DNNs, i-vectors, and MMI-trained
Gaussian classifier
- Variations:
– DNNs
- Bottleneck
- Clustered phone state (senones)
– i-vectors
- Acoustic
- Phonotactic
- Joint
13
Back-end and Fusion
- Systems are already calibrated
– MMI training of covariance scaling and means
- Duration modeling/scaling
- Fusion by averaging calibrated scores
– Learn overall scaling after average
m n n m
S t t t c LL + =
14
LRE15 Cluster Scoring
- Task: closed set detection per cluster
– Use Bayes’ rule for each – ID posteriors sum to 1 per cluster (sum to 6 total) – Convert to detection LLRs
- No cluster-specific systems or fusion
– This is a generic 20 language LID system!
15
Training Data and Augmentation
UBM/T Classifier
3-30 seconds of speech (uniform) LRE Training
REVERB NOISE RESAMPLE ENCODE
Simple: all provided data, full cuts (up to 120 sec), no augmentation
DATA SEGMENT AUGMENT
16
Augmentation types
- Sample rate perturbation
– Distorts pitch and speaking rate
- Additive Noise
– Multiband modulated Gaussian noise
- Reverberation
– Long or short synthetic impulse response
- Multiband compression
– Dynamic range compression
- Cellular speech coding
– GSM-AMR at 4.75 or 6.7 kb/s
17
Submission Performance
System Avg Cavg [0] Bottleneck, joint 19.9 [1] Senone, acoustic
- [2] Senone, phonotactic
- [3] Senone, joint
19.7 [0,3] Fusion 18.8 [0,1,2] Fusion 18.5
18
Post-eval Improvements
- Classifier tuning
– ML initialization to MMI instead of single-cut Bayesian (PLDA)
- List usage
– No Switchboard for UBM/T
- No segmentation/augmentation
- Smallest possible number of cuts!
19
Post-eval Results
System Submission Classifier Class+lists Acoustic baseline
- 23.8
22.2 [0] Bottleneck, joint 19.9 19.2 18.5 [1] Senone, acoustic
- 20.2
19.2 [2] Senone, phonotactic
- 20.9
20.3 [3] Senone, joint 19.7 19.2 18.4 [0,3] Fusion 18.8 18.1 17.3 [0,1,2] Fusion 18.5 18.0 17.3
20
Conclusion
- JHU HLTCOE strong performer in LRE15
- Key components