The Sheffield language recognition system in NIST LRE 2015 Raymond - - PowerPoint PPT Presentation

▶

Jan 06, 2024 117 likes •420 views

The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22

SLIDE 1

The Sheffield language recognition system in NIST LRE 2015

Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain

University of Sheffield, UK &

The Chinese University of Hong Kong

22 June 2016

SLIDE 2

Introduction Segmentation Component systems System fusion Conclusion

2 of 24

SLIDE 3

Introduction

Background: Classical approaches with acoustic-phonetic and phonotactic features [Zissman 1996;

Ambikairajah et al. 2011; Li et al. 2013]

Shifted-delta cepstral coefficient [Torres-Carrasquillo et al. 2002] I-vectors [Dehak, Torres-Carrasquillo, et al. 2011; Martinez et al. 2012], DNN [Ferrer et al. 2014;

Richardson et al. 2015] and their combination [Ferrer et al. 2016]

Sheffield LRE system: four LR systems I-vector Phonotactic “Direct” DNN Bottleneck + I-vector 3 of 24

SLIDE 4

Data and target language

Training data Switchboard 1, Switchboard Cellular Part 2 LRE 2015 Training data

Cluster Target languages Arabic Egyptian (ara-arz), Iraqi (ara-acm), Levantine (ara-apc), Maghrebi (ara-ary), Modern Standard (ara-arb) English British (eng-gbr), General American (eng-usg), Indian (eng-sas) French West African (fre-waf), Haitian Creole (fre-hat) Slavic Polish (qsl-pol), Russian (qsl-rus) Iberian Caribbean Spanish (spa-car), European Spanish (spa-eur), Latin American Spanish (spa-lac), Brazilian Portuguese (por-brz) Chinese Cantonese (zho-yue), Mandarin (zho-cmn), Min (zho-cdo), Wu (zho-wuu)

4 of 24

SLIDE 5

Voice activity detection

Training data: CTS data: CMLLR+BMMI SWB model → alignment → SIL vs non-SIL BNBS data: VAD reference from 1% of VOA2, VOA3 files Class balancing: add more non-speech data

Duration Dataset (Speech) (Non–speech) Switchboard 1 210h 288h VOA2 55h 61h VOA3 93h 72h Total 358h 421h

5 of 24

SLIDE 6

Voice activity detection

DNN frame-based Speech / non-speech classifier Features: Filterbank (23D) ±15 frames, DCT across time → 368 Framewise classification: DNN 368-1000-1000-2, lr: 0.001, newbob Sequence alignment: 2-state HMM, minimum state-duration 20 frames (200ms) Smoothing: Merging heuristics to bridge non-speech gaps < 2 seconds Results (collar:10ms)

Dataset Duration Miss False alarm SER Switchboard 1 17.3h 2.21% 2.63% 4.84% VOA2-test 7.9h 19.43% 78.61% 98.04%

6 of 24

SLIDE 7

Segmentation of LRE data

V1 (30s) and V3 (3s, 10s, 30s) V1 data VAD, sequence alignment, smoothing Filtering (20s ≤ segment length < 45s) Total 147.8h V3 data Phone decoding with SWB tokeniser (and V1 segmentation) Resegmentation (30s) 320.5h (10s) 262.0h (3s) 308.4h Data partition 80% train, 10% development, 10% internal test 7 of 24

SLIDE 8

NIST LRE 2015 primary system

I-‑vector ¡ DNN ¡ Phonotac0c ¡ Switchboard ¡ tokeniser ¡ SVM ¡classifier ¡ VAD ¡ Frame-‑based ¡ Language ¡DNN ¡ BoCleneck ¡ features ¡ SVM ¡/ ¡LogReg ¡ UBM ¡/ ¡Tv ¡ training ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡

8 of 24

SLIDE 9

I-vector LR system

Feature processing Normalisation: VTLN Shifted delta cepstrum: 7 + 7-1-3-7 [Torres-carrasquillo et al. 2002] Mean normalisation and frame-based VAD UBM and total variability UBM: 2048, full-covariance GMM Total variability: 114688 × 600 [Dehak, Kenny, et al. 2011] Language classifier Support vector machine Logistic regression classifier Focus of study Data in UBM, total variability matrix training Language classifier Global / within-cluster classifier 9 of 24

SLIDE 10

I-vector LR system: results on V1 data

Configurations:

A: UBM & total variability (Tv) matrix trained on 148h selected data B: Augmenting the UBM & Tv training data in A to full training set (884h) C: Using Logistic regression (LogReg) instead of SVM as LR classifier D: Augmenting LogReg training data in C to full training set (884h)

Observations: Within-cluster classifier outperforms

global classifier;

Best training data (UBM and Tv): 887h;

4 5 6 7 8 9 10 A A B C D Global Within-cluster Configurations Min DCF (%)

10.75 6.35 6.00 4.54 4.42

10 of 24

SLIDE 11

I-vector LR system: results on V3 data

Configurations:

B: Augmenting the UBM & Tv training data in A to full training set (884h) C: Using Logistic regression (LogReg) instead of SVM as LR classifier D: Augmenting LogReg training data in C to full training set (884h)

Observations: Within-cluster classifier outperforms

global classifier;

Best training data (LR classifier): 332h; Logistic regression outperforms SVM.

4 5 6 7 8 9 10 B C(V1) C C D Global Within-cluster Configurations Min DCF (%)

7.90 7.74 6.78 6.09 7.48

11 of 24

SLIDE 12

Phonotactic LR system

DNN phone tokeniser LDA, Speaker CMLLR 400-2048(× 6)-64-3815 DNN Phone-bigram LM (scale factor = 0.5) (Optional) sequence training on SWB data Langauge classifier: phone n-gram tf-idf statistics: Phone bi-gram / phone tri-gram ( 5M dimension) 12 of 24

SLIDE 13

Phonotactic LR system: results

Test on V1 30-second internal test data:

9 9.5 10 10.5 11 2-gram 3-gram DNN fMPE DNN Min DCF (%)

9.0 9.8 9.8 10.7

Observations 3-gram tf-idf outperforms 2-gram Discriminatively trained DNN ✗ Test on V3 30s data → 11.3% 13 of 24

SLIDE 14

DNN LR system

Features: 64-dimensional bottleneck features from the Switchboard tokeniser Feature splicing ± 4 frames Language recogniser: 576 - 750 × 4 - 20 Prior normalisation: Test probabilities multiplied by inverse of language

prior (train)

Decision: Frame-based language likelihood averaged over whole utterance 14 of 24

SLIDE 15

DNN LR system: results

Test on V1 and V3 (internal test) data with different durations

5 10 15 20 25 V1 V3 30 sec 10 sec 3 sec Test data Min DCF (%)

15.96 18.74 21.55 18.07 21.71

15 of 24

SLIDE 16

Enhanced system

I-‑vector ¡ DNN ¡ Phonotac0c ¡ Switchboard ¡ tokeniser ¡ SVM ¡ VAD ¡ ASR-‑based ¡silence ¡ detec0on ¡ Frame-‑based ¡ Language ¡ DNN ¡ BoCleneck ¡ features ¡ Log ¡Reg ¡ UBM ¡/ ¡Tv ¡ training ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ BoCleneck ¡I-‑vector ¡ Log ¡Reg ¡ UBM ¡/ ¡Tv ¡ training ¡ BoCleneck ¡ features ¡

16 of 24

SLIDE 17

Bottleneck I-vector system

Feature processing 64-dimensional bottleneck features from the Switchboard tokeniser No VTLN, No SDC, No mean/variance normalisation Frame-based VAD UBM and total variability UBM: 2048, full-covariance GMM Total variability: 131072 × 600 [Dehak, Kenny, et al. 2011] Language classifier Logistic regression classifier 17 of 24

SLIDE 18

Bottleneck I-vector system: results

i-vector v.s. bottleneck i-vector systems on internal test data

6.09 ¡ 12.23 ¡ 17.2 ¡ 5.13 ¡ 9.06 ¡ 13.69 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ 18 ¡ 30s ¡ 10s ¡ 3s ¡

ivector ¡ bn-‑ivector ¡

minDCF ¡

18 of 24

SLIDE 19

System calibration and fusion

Gaussian backend applied on single system output GMM (4/8/16 components) trained on the score vectors from training data (30s) GMMs are target language dependent Logistic regresion Log-likelihood-ratio conversion System combination weight trained on dev data (10%) [Br¨

ummer et al. 2006]

19 of 24

SLIDE 20

System calibration results

Overall min DCF on internal test - 30s

12.48 ¡ 26.54 ¡ 29.51 ¡ 20.17 ¡ 22.5 ¡ 22 ¡

10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ Ivector ¡ DNN ¡ Phonotac:c ¡

No ¡calibra:on ¡ Gaussain ¡backend ¡

minDCF ¡(global ¡Thr) ¡

20 of 24

SLIDE 21

System fusion results

min ¡DCF ¡(Global ¡Thr) ¡

10.21 ¡ 19.97 ¡ 19.21 ¡ 9.42 ¡ 10.84 ¡ 8.87 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡

Internal ¡test ¡– ¡30s ¡

min ¡DCF ¡(Global ¡Thr) ¡

22.83 ¡ 21.81 ¡ 36.11 ¡ 17.7 ¡ 18.47 ¡ 15.53 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ 35 ¡ 40 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡

Internal ¡test ¡– ¡3s ¡

21 of 24

SLIDE 22

System fusion results - LR2015EVAL

Overal eval system results

32.92 ¡ 40.16 ¡ 36.93 ¡ 32.44 ¡ 29.56 ¡ 29.2 ¡

25 ¡ 30 ¡ 35 ¡ 40 ¡ Ivector ¡ DNN ¡ Phonotac8c ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡

min ¡DCF ¡(Global ¡Thr) ¡

22 of 24

SLIDE 23

Pairwise system contribution

Results shown on eval 30s data System fusion always improves performance except for fusion with DNN For any given system, pairwise system fusion with a better sysetm

generally gives better results.

23 of 24

SLIDE 24

Conclusion

Introduction to 3 LR component systems submitted to NIST LRE 2015 Descriptions to segmentation, data selection plans and classifier training An enhanced bottleneck i-vector system demonstrated good performance Future work Data selection and augmentation multi-lingual NN, bottleneck Variability compensation Suggestions and collaborations welcome 24 of 24

SLIDE 25

Reference (1)

E. Ambikairajah et al. “Language Identification: A Tutorial”. In:

Circuits and Systems Magazine, IEEE 11.2 (secondquarter 2011),

pp. 82–108. issn: 1531-636X. doi: 10.1109/MCAS.2011.941081.

Niko Br¨ ummer and Johan du Preez. “Application-independent evaluation of speaker detection”. In: Comput. Speech Lang. 20.2-3 (2006), pp. 230–275.

N. Dehak, P. Kenny, et al. “Front-end factor analysis for speaker

verification”. In: IEEE Transactions on Speech and Audio Processing 19.4 (May 2011).

25 of 24

SLIDE 26

Reference (2)

N. Dehak, P. A. Torres-Carrasquillo, et al. “Language recognition via

Ivectors and dimensionality reduction”. In: Proc. Interspeech. 2011,

pp. 857–860.

Luciana Ferrer et al. “Spoken language recognition based on senone posteriors”. In: Proc. Interspeech. 2014, pp. 2150–2154. Luciana Ferrer et al. “Study of Senone-based Deep Neural Network Approaches for Spoken Language Recognition”. In: IEEE/ACM

Trans. Audio, Speech and Lang. Proc. 24.1 (Jan. 2016),
pp. 105–116. issn: 2329-9290.

26 of 24

SLIDE 27

Reference (3)

Haizhou Li, Bin Ma, and Kong Aik Lee. “Spoken Language Recognition: From Fundamentals to Practice”. In: Proceedings of the IEEE 101.5 (May 2013), pp. 1136–1159.

D. Martinez et al. “IVECTOR-based prosodic system for language

identification”. In: 2012, pp. 4861–4864.

F. Richardson, D. Reynolds, and N. Dehak. “Deep Neural Network

Approaches to Speaker and Language Recognition”. In: Signal Processing Letters, IEEE 22.10 (Oct. 2015), pp. 1671–1675.

P. A. Torres-Carrasquillo et al. “Approaches to language

identification using Gaussian mixture models and shifted delta cepstral features”. In: Proc. ICSLP. 2002.

27 of 24

SLIDE 28

Reference (4)

Pedro A. Torres-carrasquillo, Douglas A. Reynolds, and J.R. Deller Jr. “Language Identification Using Gaussian Mixture Model Tokenization”. In: Proc. IEEE ICASSP. 2002, pp. 757–760.

M. A. Zissman. “Comparison of four approaches to automatic

language identification of telephone speech”. In: IEEE Transactions

n Speech and Audio Processing 4.1 (Jan. 1996), pp. 31–44.

28 of 24

SLIDE 29

Logistic regression

To find a linear transformation of the i-vector to maximise the log

likelihood ratio log P(y|w) 1 − P(y|w) = βT

1 w + β0

P(y|w) 1 − P(y|w) = exp (β1w) exp (β0) P(y|w) = 1 1 + exp (−βT

1 w − β0)

Useful when the target variable is bounded or discrete To transform i-vector into language recognition score To combine multiple system scores into one 29 of 24

SLIDE 30

System performance

System performance on LR 2015 Eval data 24 ¡ 26 ¡ 28 ¡ 30 ¡ 32 ¡ 34 ¡ 36 ¡ 38 ¡ 40 ¡ 30-‑sec ¡ 10-‑sec ¡ 3-‑sec ¡

verall ¡

I-‑vector ¡ DNN ¡ Phonotac9c ¡ 3-‑sys ¡ BN-‑IV ¡ 4-‑sys ¡ 30 of 24

The Sheffield language recognition system in NIST LRE 2015

Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain

University of Sheffield, UK &

22 June 2016

Introduction Segmentation Component systems System fusion Conclusion

Introduction

Data and target language

Voice activity detection

Duration Dataset (Speech) (Non–speech) Switchboard 1 210h 288h VOA2 55h 61h VOA3 93h 72h Total 358h 421h

Voice activity detection

Dataset Duration Miss False alarm SER Switchboard 1 17.3h 2.21% 2.63% 4.84% VOA2-test 7.9h 19.43% 78.61% 98.04%

Segmentation of LRE data

NIST LRE 2015 primary system

I-vector LR system

I-vector LR system: results on V1 data

4 5 6 7 8 9 10 A A B C D Global Within-cluster Configurations Min DCF (%)

I-vector LR system: results on V3 data

4 5 6 7 8 9 10 B C(V1) C C D Global Within-cluster Configurations Min DCF (%)

Phonotactic LR system

Phonotactic LR system: results

9 9.5 10 10.5 11 2-gram 3-gram DNN fMPE DNN Min DCF (%)

DNN LR system

prior (train)

DNN LR system: results

5 10 15 20 25 V1 V3 30 sec 10 sec 3 sec Test data Min DCF (%)

Enhanced system

Bottleneck I-vector system

Bottleneck I-vector system: results

ivector ¡ bn-­‑ivector ¡

minDCF ¡

System calibration and fusion

System calibration results

12.48 ¡ 26.54 ¡ 29.51 ¡ 20.17 ¡ 22.5 ¡ 22 ¡

10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ Ivector ¡ DNN ¡ Phonotac:c ¡

No ¡calibra:on ¡ Gaussain ¡backend ¡

minDCF ¡(global ¡Thr) ¡

System fusion results

Internal ¡test ¡– ¡30s ¡

Internal ¡test ¡– ¡3s ¡

System fusion results - LR2015EVAL

Pairwise system contribution

generally gives better results.

Conclusion

Reference (1)

Circuits and Systems Magazine, IEEE 11.2 (secondquarter 2011),

Niko Br¨ ummer and Johan du Preez. “Application-independent evaluation of speaker detection”. In: Comput. Speech Lang. 20.2-3 (2006), pp. 230–275.

verification”. In: IEEE Transactions on Speech and Audio Processing 19.4 (May 2011).

Reference (2)

Ivectors and dimensionality reduction”. In: Proc. Interspeech. 2011,

Luciana Ferrer et al. “Spoken language recognition based on senone posteriors”. In: Proc. Interspeech. 2014, pp. 2150–2154. Luciana Ferrer et al. “Study of Senone-based Deep Neural Network Approaches for Spoken Language Recognition”. In: IEEE/ACM

Reference (3)

Haizhou Li, Bin Ma, and Kong Aik Lee. “Spoken Language Recognition: From Fundamentals to Practice”. In: Proceedings of the IEEE 101.5 (May 2013), pp. 1136–1159.

identification”. In: 2012, pp. 4861–4864.

Approaches to Speaker and Language Recognition”. In: Signal Processing Letters, IEEE 22.10 (Oct. 2015), pp. 1671–1675.

identification using Gaussian mixture models and shifted delta cepstral features”. In: Proc. ICSLP. 2002.

Reference (4)

Pedro A. Torres-carrasquillo, Douglas A. Reynolds, and J.R. Deller Jr. “Language Identification Using Gaussian Mixture Model Tokenization”. In: Proc. IEEE ICASSP. 2002, pp. 757–760.

language identification of telephone speech”. In: IEEE Transactions

Logistic regression

likelihood ratio log P(y|w) 1 − P(y|w) = βT

P(y|w) 1 − P(y|w) = exp (β1w) exp (β0) P(y|w) = 1 1 + exp (−βT

System performance

ivector ¡ bn-‑ivector ¡