Tandem investigations - Dan Ellis 2001-01-25 - 1
Tandem modeling investigations Dan Ellis International Computer - - PowerPoint PPT Presentation
Tandem modeling investigations Dan Ellis International Computer - - PowerPoint PPT Presentation
Tandem modeling investigations Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 What makes Tandem successful? 2 Can we make Tandem better? 3 Does Tandem work with LVCSR tricks?
Tandem investigations - Dan Ellis 2001-01-25 - 2
What makes Tandem work?
(with Manuel Reyes)
- Model diversity?
- try a phone-based GMM model
- try training the NN model to HTK state labels
- Discriminative network training?
- (try posteriors from GMM & Bayes)
1
plp
Input sound
Neural net classifier
C0 C1 C2 Ck tn tn+w h# pcl bcl tcl dclPCA
- rthog'n
msg Neural net classifier
C0 C1 C2 Ck tn tn+w h# pcl bcl tcl dclGauss mix models HTK decoder
Words
s ah t
+
Tandem combo over HTK mfcc baseline: +53%
Combo-into-HTK over Combo-into-noway: +15% Combo over msg: +20% NN over HTK: +15% Combo over mfcc: +25% Tandem over hybrid: +25% Tandem over HTK: +35% Combo over plp: +20% KLT over direct: +8% Pre-nonlinearity over posteriors: +12%
Tandem investigations - Dan Ellis 2001-01-25 - 3
Phone vs. word models
- Try a phone-based HTK model
(instead of whole-word models)
- Try training NN model to subword-state labels
- 181 net outputs; reduce to 40 in KLT
- Results (Aurora2k, HTK-baseline WER ratio):
- Diversity doesn’t help
- subword units may be good for NN
System test A: matched test B: var noise test C: var chan Tandem PLP baseline 63.5% 70l.3% 59.5% Phone-based HTK sys 63.6% 72.5% 61.5% Subword-based NN sys 63.1% 62.8% 55.1%
plp
Input sound
Neural net classifier
C0 C1 C2 Ck tn tn+w h# pcl bcl tcl dclKLT
- rthog'n
Gauss mix models HTK decoder
Words
s ah t
Trained on phoneme targets Trained to subword states
Tandem investigations - Dan Ellis 2001-01-25 - 4
Enhancements to Tandem-Aurora
- More tandem-feature-domain processing:
- Results (HTK baseline WER ratio):
- delta-KLT-norm: 80% Tdm baseline WER
System test A: matched test B: var noise test C: var chan PLP: Tandem baseline 63.5% 70l.3% 59.5% PLP: norm - KLT 72.6% 71.2% 63.6% PLP: KLT - norm 57.8% 58.8% 51.3% PLP: KLT - delta 59.0% 60.2% 52.9% PLP: KLT - delta - norm 58.1% 59.9% 48.9% PLP: delta - KLT - norm 54.7% 53.6% 46.9%
2
Gauss mix models Neural net classifier
C0 C1 C2 Ck tn tn+w h# pcl bcl tcl dclnorm / deltas? norm / deltas? KLT
- rthog'n
Tandem investigations - Dan Ellis 2001-01-25 - 5
Best effort Tandem system
- Deltas & norms help PLP:
try on combo (PLP+MSG) system :
- deltas
hurt for MSG: features too sluggish?
- Deltas help clean, norms help noisy:
System test A: matched test B: var noise test C: var chan PLP+MSG: baseline 51.1% 52.0% 45.6% PLP+MSG: dlt-KLT-nrm 50.9% 50.5% 43.6% PLP+MSG: KLT-nrm 48.3% 49.5% 39.4%
- 5
5 10 15 20 clean 1 2 3 4 5 6 7 10 20 30 40 50 60 70 baseline K-D K-N
WER / % WER / %
SNR / dB
Tandem investigations - Dan Ellis 2001-01-25 - 6
Tandem for LVCSR: the SPINE task
(with Rita Singh/CMU & Sunil Sivadas/OGI)
- Noisy spontaneous speech, ~5000 word vocab
- Recognition:
- same tandem features
- NN training from Broadcast News boot + iterate
- GMM-HMM has context-dependence, MLLR
3
Input sound
Neural net classifier 1
Pre-nonlinearity
- utputs
PCA decorrelation Neural net classifier 2
C0 C1 C2 Ck tn tn+w h# pcl bcl tcl dclSubword likelihoods MLLR adaptation Words
+
Tandem feature calculation SPHINX recognizer
PLP feature calculation MSG feature calculation GMM classifier HMM decoder
s ah t
Tandem investigations - Dan Ellis 2001-01-25 - 7
SPINE-Tandem results
- Evaluation WER results:
- much better for CI systems
- differences evaporate with CD, MLLR
- Not quite fair:
- CD senones optimized for MFCC
- worth 2-3% absolute?
- Not unexpected:
- NN confounds CD variants
- Tandem ‘space’ very nonlinear - bad for MLLR
- Any hope?
- more training data / train CD classes / ...