joint factor analysis for text dependent speaker
play

Joint Factor Analysis for Text-Dependent Speaker Verification - PowerPoint PPT Presentation

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Joint Factor Analysis for Text-Dependent Speaker Verification Patrick Kenny, Themos


  1. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Joint Factor Analysis for Text-Dependent Speaker Verification Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam, Pierre Ouellet and Marcel Kockmann Odyssey Speaker and Language Recognition Workshop June, 2014 1 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  2. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Text-dependent speaker recognition Lexical constraints enable speaker verification with short utterances The classes to be recognized are speaker-phrase combinations rather than speakers as such Speaker-phrase variability cannot generally be modeled using subspace methods (i-vectors or speaker factors) Achieving channel robsustness is hard Left-to-right structure can be exploited but this would complicate channel modeling There is no such thing as a “universal” UBM 2 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  3. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion JFA without speaker factors Given parallel recordings of a phrase by a speaker indexed by r Each recordings is assumed to be characterized by a GMM whose mean vectors are of the form m c + U c x r + D c z c where the hidden variables x r and z c have standard normal priors ( c for mixture component) Ux r models channel variability, Dz models speaker-phrase variability Typically U is estimated by maximum likelihood II and D c by relevance MAP The prior on z is factorial in the sense that P ( z ) = � c P ( z c ) 3 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  4. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Vogt’s algorithm [Vogt, 2008] Starting from Baum-Welch statistics, the hidden variables can be estimated by alternating between x and z This is a variational Bayes algorithm so it produces a variational lower bound which can be used to do likelihood (or evidence) calculations For example, speaker verification decisions can be made by Bayesian model selection in the same way as in PLDA Given enrollment and test utterances, is the data better accounted for by positing one z -vector or two? 4 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  5. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Zhao and Dong’s algorithm [Zhao, 2010] Baum-Welch statistics ought to be collected with GMMs adapted from the UBM (rather than with the UBM itself) Introduce extra hidden variables to account for the alignment between frames and mixture components Variational Bayes and Bayesian model selection by extending Vogt’s algorithm Caveat We do not have to liberty to adapt the UBM using some of the hidden variables but not others — this is a problem 5 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  6. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion JFA-Based Classifiers z -vectors as features [Kenny, 2014] UBM adaptation results in a severe degradation Maximum likelihood II misbehaves in the absence of UBM adaptation Bayesian model selection UBM adaptation helps for small codebooks (64 Gaussians) UBM adaptation hurts for large codebooks (512) Traditional JFA likelihood ratios (JFA-LLR) Works better than Bayesian model selection without UBM adaptation Can be made to work much better (40% error rate reductions) with careful UBM adaptation 6 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  7. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Anomalous results I Bayesian model selection, 512 Gaussians (columns 1 and 2), 64 Gaussians (columns 3 and 4), Vogt vs. Zhao & Dong EER 2008 NDCF EER 2008 NDCF Vogt 2.2% 0.085 3.6% 0.145 Zhao & Dong 2.7% 0.096 3.4% 0.133 With 64 Gaussians UBM adaptation is helpful With 512 Gaussians UBM adaptation is not helpful Most experiments conducted on a “hard” subset of RSR2015 (generally with 64 Gaussians) 7 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  8. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Anomalous results II Bayesian model selection versus traditional JFA log likelihood ratios (JFA-LLRs), 512 and 64 Gaussians EER 2008 NDCF EER 2008 NDCF Vogt 2.2% 0.085 3.6% 0.145 Zhao & Dong 2.7% 0.096 3.4% 0.133 JFA-LLR 1.7% 0.065 2.7% 0.110 Traditional JFA-LLRs work better than Bayesian model selection for both 512 and 64 Gaussians No UBM adaptation in traditional JFA-LLR calculation 8 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  9. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion The paper in a nutshell JFA-LLR results can be substantially improved (40% reductions in error rates) with careful UBM adaptation Adapting to lexical content, to speaker effects in enrollment utterances, and to channel effects in test utterances are all helpful. Do not adapt to speaker effects in test utterances. Maximum likelihood II estimation works properly if LLRs are evaluated with UBM adaptation Even so, JFA works better as a feature extractor In this case, UBM adaptation is not helpful The factorial prior is too weak 9 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  10. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Phrase-dependent background modeling (PBM) a-b-c EER 2008 NDCF UBM 0-1-1 2.7% 0.110 PBM 1-1-1 2.1% 0.092 In traditional JFA the numerator of the LLR is evaluated by comparing the test speaker to the “UBM speaker” In text-dependent speaker recognition, lexical content introduces a mismatch with the UBM Mean supervectors can be made phrase-dependent, other JFA parameters shared across phrases a - b - c counts alignment iterations in phrase adaptation ( a ), in processing enrollment utterances ( b ) and in processing test utterances ( c ) 10 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  11. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Adapting to the channel effects in the test utterance a-b-c EER 2008 NDCF channel factors 1-1-1 2.1% 0.092 channel factors + adaptation 1-1-5 2.0% 0.086 Traditional JFA uses the UBM (or PBM) to align the test data and integrates over channel factors Eigenchannel modeling as originally conceived involved multiple alignment iterations Variational Bayes enables you to do both 11 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  12. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Adapting to the speaker effects in the enrollment utterances a-b-c EER 2008 NDCF 1-1-1 2.1% 0.092 1-5-1 2.0% 0.080 To evaluate the denominator of the likelihood ratio, collect Baum-Welch statistics with a GMM that has been adapted to the speaker effects in the enrollment utterances Do not adapt to the speaker effects in the test utterance 12 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  13. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Phrase-dependent background modeling – again a-b-c EER 2008 NDCF 5-1-1 1.7% 0.076 5-5-1 1.7% 0.070 5-5-5 1.6% 0.066 Estimating PBMs with 5 iterations of relevance MAP rather than one gives another major improvement 13 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

  14. Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Maximum likelihood II works with UBM adaptation a-b-c EER 2008 NDCF relevance MAP 5-5-1 1.7% 0.070 diagonal D c 5-5-1 1.7% 0.069 full D c 5-5-1 1.7% 0.065 c Σ − 1 For relevance MAP , D ∗ c D c = 1 / r I ( r = relevance factor) Maximum likelihood II estimation of D only works if multiple alignment iterations are performed in JFA training (and at enrollment time) If D c is taken to be be full it turns out to be of low rank (compare [Hasan, 2013]) 14 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend