The Sheffield language recognition system in NIST LRE 2015
Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain
University of Sheffield, UK &
The Chinese University of Hong Kong
The Sheffield language recognition system in NIST LRE 2015 Raymond - - PowerPoint PPT Presentation
The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22
The Chinese University of Hong Kong
2 of 24
Background: Classical approaches with acoustic-phonetic and phonotactic features [Zissman 1996;
Ambikairajah et al. 2011; Li et al. 2013]
Shifted-delta cepstral coefficient [Torres-Carrasquillo et al. 2002] I-vectors [Dehak, Torres-Carrasquillo, et al. 2011; Martinez et al. 2012], DNN [Ferrer et al. 2014;
Richardson et al. 2015] and their combination [Ferrer et al. 2016]
Sheffield LRE system: four LR systems I-vector Phonotactic “Direct” DNN Bottleneck + I-vector 3 of 24
Training data Switchboard 1, Switchboard Cellular Part 2 LRE 2015 Training data
Cluster Target languages Arabic Egyptian (ara-arz), Iraqi (ara-acm), Levantine (ara-apc), Maghrebi (ara-ary), Modern Standard (ara-arb) English British (eng-gbr), General American (eng-usg), Indian (eng-sas) French West African (fre-waf), Haitian Creole (fre-hat) Slavic Polish (qsl-pol), Russian (qsl-rus) Iberian Caribbean Spanish (spa-car), European Spanish (spa-eur), Latin American Spanish (spa-lac), Brazilian Portuguese (por-brz) Chinese Cantonese (zho-yue), Mandarin (zho-cmn), Min (zho-cdo), Wu (zho-wuu)
4 of 24
Training data: CTS data: CMLLR+BMMI SWB model → alignment → SIL vs non-SIL BNBS data: VAD reference from 1% of VOA2, VOA3 files Class balancing: add more non-speech data
5 of 24
DNN frame-based Speech / non-speech classifier Features: Filterbank (23D) ±15 frames, DCT across time → 368 Framewise classification: DNN 368-1000-1000-2, lr: 0.001, newbob Sequence alignment: 2-state HMM, minimum state-duration 20 frames (200ms) Smoothing: Merging heuristics to bridge non-speech gaps < 2 seconds Results (collar:10ms)
6 of 24
V1 (30s) and V3 (3s, 10s, 30s) V1 data VAD, sequence alignment, smoothing Filtering (20s ≤ segment length < 45s) Total 147.8h V3 data Phone decoding with SWB tokeniser (and V1 segmentation) Resegmentation (30s) 320.5h (10s) 262.0h (3s) 308.4h Data partition 80% train, 10% development, 10% internal test 7 of 24
I-‑vector ¡ DNN ¡ Phonotac0c ¡ Switchboard ¡ tokeniser ¡ SVM ¡classifier ¡ VAD ¡ Frame-‑based ¡ Language ¡DNN ¡ BoCleneck ¡ features ¡ SVM ¡/ ¡LogReg ¡ UBM ¡/ ¡Tv ¡ training ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡
8 of 24
Feature processing Normalisation: VTLN Shifted delta cepstrum: 7 + 7-1-3-7 [Torres-carrasquillo et al. 2002] Mean normalisation and frame-based VAD UBM and total variability UBM: 2048, full-covariance GMM Total variability: 114688 × 600 [Dehak, Kenny, et al. 2011] Language classifier Support vector machine Logistic regression classifier Focus of study Data in UBM, total variability matrix training Language classifier Global / within-cluster classifier 9 of 24
Configurations:
A: UBM & total variability (Tv) matrix trained on 148h selected data B: Augmenting the UBM & Tv training data in A to full training set (884h) C: Using Logistic regression (LogReg) instead of SVM as LR classifier D: Augmenting LogReg training data in C to full training set (884h)
Observations: Within-cluster classifier outperforms
global classifier;
Best training data (UBM and Tv): 887h;
10.75 6.35 6.00 4.54 4.42
10 of 24
Configurations:
B: Augmenting the UBM & Tv training data in A to full training set (884h) C: Using Logistic regression (LogReg) instead of SVM as LR classifier D: Augmenting LogReg training data in C to full training set (884h)
Observations: Within-cluster classifier outperforms
global classifier;
Best training data (LR classifier): 332h; Logistic regression outperforms SVM.
7.90 7.74 6.78 6.09 7.48
11 of 24
DNN phone tokeniser LDA, Speaker CMLLR 400-2048(× 6)-64-3815 DNN Phone-bigram LM (scale factor = 0.5) (Optional) sequence training on SWB data Langauge classifier: phone n-gram tf-idf statistics: Phone bi-gram / phone tri-gram ( 5M dimension) 12 of 24
Test on V1 30-second internal test data:
9.0 9.8 9.8 10.7
Observations 3-gram tf-idf outperforms 2-gram Discriminatively trained DNN ✗ Test on V3 30s data → 11.3% 13 of 24
Features: 64-dimensional bottleneck features from the Switchboard tokeniser Feature splicing ± 4 frames Language recogniser: 576 - 750 × 4 - 20 Prior normalisation: Test probabilities multiplied by inverse of language
Decision: Frame-based language likelihood averaged over whole utterance 14 of 24
Test on V1 and V3 (internal test) data with different durations
15.96 18.74 21.55 18.07 21.71
15 of 24
I-‑vector ¡ DNN ¡ Phonotac0c ¡ Switchboard ¡ tokeniser ¡ SVM ¡ VAD ¡ ASR-‑based ¡silence ¡ detec0on ¡ Frame-‑based ¡ Language ¡ DNN ¡ BoCleneck ¡ features ¡ Log ¡Reg ¡ UBM ¡/ ¡Tv ¡ training ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ BoCleneck ¡I-‑vector ¡ Log ¡Reg ¡ UBM ¡/ ¡Tv ¡ training ¡ BoCleneck ¡ features ¡
16 of 24
Feature processing 64-dimensional bottleneck features from the Switchboard tokeniser No VTLN, No SDC, No mean/variance normalisation Frame-based VAD UBM and total variability UBM: 2048, full-covariance GMM Total variability: 131072 × 600 [Dehak, Kenny, et al. 2011] Language classifier Logistic regression classifier 17 of 24
i-vector v.s. bottleneck i-vector systems on internal test data
6.09 ¡ 12.23 ¡ 17.2 ¡ 5.13 ¡ 9.06 ¡ 13.69 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ 18 ¡ 30s ¡ 10s ¡ 3s ¡
18 of 24
Gaussian backend applied on single system output GMM (4/8/16 components) trained on the score vectors from training data (30s) GMMs are target language dependent Logistic regresion Log-likelihood-ratio conversion System combination weight trained on dev data (10%) [Br¨
ummer et al. 2006]
19 of 24
Overall min DCF on internal test - 30s
20 of 24
min ¡DCF ¡(Global ¡Thr) ¡
10.21 ¡ 19.97 ¡ 19.21 ¡ 9.42 ¡ 10.84 ¡ 8.87 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡
min ¡DCF ¡(Global ¡Thr) ¡
22.83 ¡ 21.81 ¡ 36.11 ¡ 17.7 ¡ 18.47 ¡ 15.53 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ 35 ¡ 40 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡
21 of 24
Overal eval system results
32.92 ¡ 40.16 ¡ 36.93 ¡ 32.44 ¡ 29.56 ¡ 29.2 ¡
25 ¡ 30 ¡ 35 ¡ 40 ¡ Ivector ¡ DNN ¡ Phonotac8c ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡
min ¡DCF ¡(Global ¡Thr) ¡
22 of 24
Results shown on eval 30s data System fusion always improves performance except for fusion with DNN For any given system, pairwise system fusion with a better sysetm
23 of 24
Introduction to 3 LR component systems submitted to NIST LRE 2015 Descriptions to segmentation, data selection plans and classifier training An enhanced bottleneck i-vector system demonstrated good performance Future work Data selection and augmentation multi-lingual NN, bottleneck Variability compensation Suggestions and collaborations welcome 24 of 24
25 of 24
26 of 24
27 of 24
28 of 24
To find a linear transformation of the i-vector to maximise the log
1 w + β0
1 w − β0)
Useful when the target variable is bounded or discrete To transform i-vector into language recognition score To combine multiple system scores into one 29 of 24
System performance on LR 2015 Eval data 24 ¡ 26 ¡ 28 ¡ 30 ¡ 32 ¡ 34 ¡ 36 ¡ 38 ¡ 40 ¡ 30-‑sec ¡ 10-‑sec ¡ 3-‑sec ¡
I-‑vector ¡ DNN ¡ Phonotac9c ¡ 3-‑sys ¡ BN-‑IV ¡ 4-‑sys ¡ 30 of 24