 
              IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos
IBM Research Outline  Introduction  Speaker Recognition System  Experimental Setup  Results  Conclusions 2
IBM Research Introduction 3
IBM Research Recent Progress  Major advancements over the past several years.  SOTA i-vector systems use UBMs to estimate stats  Previous Work:  Gaussian mixture model UBM [Reynolds 1997]  Phonetically-inspired UBM (PI-UBM) [Omar 2010]  DNN-based phonetically-aware UBM [Lei 2014]  TDNN-based UBM (full-covariance) [Snyder 2015]  DNN bottleneck based features are also used in SOTA systems [Heck 1998; Richardson 2015; Matějka 2016 ] 4
IBM Research Objectives  To share state-of-the-art results on the NIST 2010 SRE  To present the key system components that helped us achieve these results • Speaker- and channel-adapted fMLLR based features • A DNN acoustic model with a large number of senones (10k) • A nearest-neighbor discriminant analysis (NDA) technique  To quantify the contribution of each component 5
IBM Research Speaker Recognition System 6
IBM Research Speaker Recognition System SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  Our i-vector based speaker recognition system • Speaker- and channel-normalized fMLLR based features • i-vectors are estimated using DNN senone posteriors (~10k) • LDA  NDA based intersession variability compensation 7
IBM Research Feature space MLLR SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech 8
IBM Research DNN Senone I-vectors SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech Senones (10k) Posteriors B-W Statistics 9
IBM Research Linear Discriminant Analysis (LDA) SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  LDA assumes unimodal and Gaussian distributions  It cannot effectively handle multimodal data  It can be rank deficient 10
IBM Research Nearest Neighbor Discriminant Analysis (NDA) Class 2 Class 1 NDA LDA C ∑ N ( )( ) ( )( ) C C ∑∑∑ i = − − T  T S μ μ μ μ = − − p ij i ij i ij M M S x x w b b i i i l l l l l = = = = i 1 i 1 j 1 l 1 ≠ j i local k -NN means global class means { } ( ) ( ) α α i i i i min d x , NN ( x , ) , i d x , NN ( x , ) j emphasize samples near l K l l K l = ij w ( ) ( ) α + α l boundary i i i i x x x x d , NN ( , ) i d , NN ( , ) j l K l l K l 11
IBM Research Experimental Setup 12
IBM Research Data  Training Data • NIST 2004-2008 SRE (English telephony and microphone data) • Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III • Total of 60,178 recordings  Evaluation data • NIST 2010 SRE (extended evaluation set) Cond. Enroll Test Mismatch #Targets #Impostors C1 Int. mic. Int. mic. (same type) No 4,034 795,995 C2 Int. mic. Int. mic. (different type) Yes 15,084 2,789,534 C3 Int. mic. Telephony Yes 3,989 637,850 C4 Int. mic. Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950 13
IBM Research DNN System Configuration  6 fully connected hidden layers with 2048 units  The bottleneck layer has 512 units  Trained using 600 hours of speech from Fisher  Input is a 9-frame context of 40-D fMLLR feats.  Estimates posterior probabilities of 10k senones  2k and 4k posteriors are also explored 14
IBM Research Speaker Recognition System Configuration  500-dimensional total variability subspace trained using a subset of 48,325 recordings from NIST SRE, SWBCELL, and SWB2  Sufficient statistics are generated using posteriors from: • Gender independent 2048-component GMM-UBM (21,207 recordings) • DNN with 7 hidden layers and 2k, 4k, or 10k senones  MFCCs and fMLLR based features are evaluated  LDA/NDA is applied to obtain 250-dimensional feature vectors  Gaussian PLDA backend trained with 60,178 speech segments  Evaluation metrics: Equal error rate (EER) and minDCF’08,’10 15
IBM Research Results 16
IBM Research LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147  NDA outperforms LDA for both GMM and DNN based systems 17
IBM Research MFCC vs fMLLR (10k DNN, C5) System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092  Speaker- and channel-normalized fMLLRs outperform MFCCs 18
IBM Research Impact of #Senones (fMLLR, C5) System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 1.19 0.054 0.212 2k DNN-NDA 0.95 0.043 0.166 DNN-LDA 0.98 0.041 0.169 4k DNN-NDA 0.86 0.033 0.116 DNN-LDA 0.82 0.032 0.120 10k DNN-NDA 0.67 0.028 0.092  Using 10k senones gives the best performance  NDA consistently outperforms LDA for 2k, 4k, and 10k senones  Note: in contrast to DNNs, increasing the number of components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015]. 19
IBM Research DET Plot Performance (C5) 20
IBM Research System Progression (C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092  Achieved the best published performance (EER = 0.67%) on NIST 2010 SRE (C5).  Building upon previous best results: EER = 1.09% [Snyder 2015]  Gender-dependent (both genders) • EER = 0.94% [ Matějka 2016]  Gender-dependent (female trials) • 21
IBM Research Conclusions 22
IBM Research Conclusions  Presented the IBM i-vector speaker recognition system:  Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions  Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance  NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data)  Achieved the best published performance ( EER = 0.67% ) on NIST 2010 SRE (C5)  For further progress on our system see us at IS-2016 23
Recommend
More recommend