- The SRI NIST SRE08 Speaker
Verification System
- M. Graciarena, S. Kajarekar, N. Scheffer
- E. Shriberg, A. Stolcke
SRI International
- L. Ferrer, Stanford U. & SRI
- T. Bocklet, U. Erlangen & SRI
The SRI NIST SRE08 Speaker Verification System M. Graciarena, S. - - PowerPoint PPT Presentation
The SRI NIST SRE08 Speaker Verification System M. Graciarena, S. Kajarekar, N. Scheffer E. Shriberg, A. Stolcke SRI International L. Ferrer, Stanford U. & SRI T. Bocklet, U. Erlangen & SRI
Development data
VAD choices
similar
comparable with other sites
ASR choices
Despite not training or tuning on interview data, performance was quite good
Separate SRI study varying style, vocal effort, and microphone, shows
cepstral systems don’t suffer from style mismatch between interviews and conversations if channel constant (Interspeech 2008)
SRE08 conditions 5-8 had dev data from SRE06 For conditions 1-4, used altmic as a surrogate for interview data
Submissions
(not evaluated in SRE08) mic 1conv4w- 1convmic (condition 5) mic 1conv4w- 1conv4w (condition 6,7,8) phn Phonecall (train) mic phn Mic type Type mic Interview (train) Conversation 1convmic-1convmic (condition 1,2,3) 1convmic- 1conv4w (condition 4) Interview (test) Phonecall (test)
1.
Lattice generation (MFCC+MLP features)
2.
N-best generation (PLP features)
3.
LM and prosodic model rescoring; confusion network decoding
Word error rates (transcripts from LDC and ICSI) Effect on ASR-based speaker verification
Nativeness ID (using MLLR-SVM): 12.5% ⇒10.9% EER
28.8 36.1 23.0 17.0 SRE08 27% 23.3 Fisher 1 native 18% 27% 22%
35.3 49.5 29.4 SRE06 SRE06 altmic Mixer 1 nonnative Mixer 1 native ASR System .818/23.5 .613/15.79 .228/6.25 .147/2.82 SRE08 1.6% 5.0% 8.8% 5.8%
.831/24.1 .645/16.46 .250/6.46 .156/3.47 SRE06 Word N-gram tel SNERF altmic MLLR altmic MLLR tel ASR System
Front-end for GMM-based cepstral systems
GMM-LLR system
GMM-SVs system
ISVs for GMM-SVs:
–
Concatenation of 50 EC from SRE04 + 50 EC from SWB2 phase 2,3,5 + 50 EC from SRE05 altmic
–
Surprising results on altmic conditions (8conv)
–
Concatenation of 80 EC from SRE04 + 80 EC from SRE05 altmic
Combination
Particularities
front-end)
instead of Eigenchannel MAP
8+8 8+8 2+2 2+2 Transforms yes no no no ASR? n/a .111 / 2.22 PLP .260 / 5.23 .138 / 2.87 PLP .266 / 5.42 .154 / 3.36 PLP .270 / 5.92 .189 / 3.90 MFCC SRE06 All * SRE06 English Feature * No language calibration used
New system for English. Submitted for 1conv (“short”) training only Best among all SRI systems for short2-short3 condition Combines 8 subsystems that use frames matching 8 constraints:
Unlike previous word- or phone-conditioned cepstral systems:
Modeling:
Post-eval analyses show that across SRE08 conditions:
After evaluation, finished 8conv training and testing. This is the best
Future Work:
Pitch and energy signals obtained with get_f0
ASR-independent systems
–
Polynomial approximation of pitch and energy profiles over pseudo-syllables + region length (Dehak ’07)
–
Order 5 polynomial coefficients with mean-variance norm. applied
–
Joint Factor Analysis on gender-dependent 256-mixture GMM models
–
Eigenvoice (70 EV on Fisher2 + NIST SRE 04 + NIST SRE 05 altmic)
–
Eigenchannel + Diagonal model (50 EC on e04+e05), same for diagonal d)
–
All polynomial orders from 0 to 5 used
–
One GMM trained for each individual feature, certain subsets and their
–
Transformed vectors are rank-normed, 16 NAP directions subtracted
–
Model these features with SVM regression and perform TZ-norm.
ASR-dependent system
–
SNERFs (syllable NERFs): extracted from all (real) syllables
–
GNERFS (grammar-constrained NERFs): extraction location constrained to specific “wordlists”
–
Extract features over those regions
–
Features reflect characteristics about the pitch, energy and duration patterns
–
Transform features and model them using same method as language independent system (except use 32 NAP directions)
–
Improvements in the feature transform
–
Use of eval04 data
–
Addition of polynomial features
performance prosodic system
0.204 4.84 0.167 4.55 0.140 4.01 0.108 2.38 MLLR 0.932 20.95 0.849 22.94 0.761 18.50 0.633 13.98 STATE-DUR 0.604 13.31 0.547 12.41 0.444 10.72 0.350 7.64 PROSODIC 23.35 17.93 16.36 16.47 4.06 3.95 2.76 1.84 1.79 1.90 1.30 %EER Tel-Tel 0.803 0.734 0.715 0.650 0.183 0.188 0.136 0.089 0.074 0.095 0.075 DCF 25.29 22.64 23.30 21.31 7.57 6.95 5.84 1.90 2.36 2.19 2.48 %EER Tel-Altmic 0.845 0.828 0.860 0.779 0.307 0.299 0.199 0.083 0.080 0.100 0.111 DCF 0.812 20.06 0.880 22.62 SV-PROSODIC 0.392 5.76 0.150 3.31 Constrained CEP 0.845 0.887 0.744 0.652 0.560 0.279 0.193 0.170 0.259 DCF 24.68 0.901 26.62 WORD-NG 26.62 0.894 25.47 WORD-DUR 19.33 0.834 23.90 POLY-PROSODIC 12.02 0.375 9.56 POLY-PLP 10.43 0.327 8.87 POLY-MFCC 6.95 0.240 6.11 MLLR-PL 3.20 0.136 3.13 SV-MFCC 3.05 0.111 2.67 SV-PLP 3.87 0.149 4.05 CEP %EER DCF %EER Altmic-Altmic Altmic-Tel Systems (by approach) filled=ASR-dep.
Linear logistic regression with auxiliary information (ICASSP’08)
Combination strategy
(not evaluated in SRE08) mic 1conv4w-1convmic (condition 5) mic 1conv4w-1conv4w (condition 6,7,8) phn phonecall mic phn Mic type Type mic Interview Conversation 1convmic-1convmic (condition 1,2,3) 1convmic-1conv4w (condition 4) Interview (no DEV data) Phonecall (DEV data)
–
*Constrained GMM
Up to 4 times
consistent
8conv-short3 has
Detailed analysis is
0.0839 1.972 0.1808 4.154 MLLR_PL 0.5120 12.282 0.7532 17.765 SV-PROSODIC 0.0396* 0.658* 0.1342 2.769 Constrained GMM 0.3992 0.3725 0.4070 0.5091 0.1614 0.1060 0.1024 0.0639 0.0633 0.0500 0.0565 mDCF 7.714 0.7622 20.685 WORD-NG 8.113 0.7793 19.626 WORD-DUR 10.253 0.6939 17.180 POLY-PROSODIC 9.208 0.6984 14.820 STATE-DUR 3.502 0.4321 10.016 PROSODIC 2.632 0.2496 6.351 POLY-PLP 2.190 0.2452 6.194 POLY-MFCC 1.312 0.1887 4.154 MLLR 1.312 0.1427 3.683 SV-MFCC 1.095 0.1424 3.419 SV-PLP 1.277 0.1395 2.914 CEP GMM %EER mDCF %EER 8conv-short3 (7408) Short2-short3 (17761) Systems (filled rows = ASR-dep)
Short2-short3 English telephone 4BEST = Constrained GMM + SV-PLP + PROS + MLLR (in order of
Combinations give different relative performance on SRE06 than on
Nativeness calibration gives small but consistent improvements
2.199 0.107 0.113 1.192 0.063 SRI_2 (8) 2.117 0.100 0.106 0.867 0.048 SRI_1 (14) 2.199 0.103 0.106 1.083 0.059 4CEP 2.769 0.132 0.134 1.192 0.072 1BEST (Constrained GMM) 1.954 0.101 0.104 0.921 0.048 4BEST 2.199 0.102 0.108 0.867 0.052 SRI_1 (14) %EER DCF(M) DCF(A) %EER DCF(M) SRE08 System/ Combination SRE06 With nativeness calibration
0.6553 12.248 0.8947 20.799 POLY-PROSODIC 0.6252 0.3767 0.2475 0.2924 0.2997 0.2461 0.2490 DCF 13.399 0.8448 20.545 SV-PROSODIC 6.021 0.5294 9.410 MLLR_PL 4.898 0.4694 9.934 POLY-PLP 5.176 0.4644 8.209 SV-PLP 4.866 0.4541 8.029 SV-MFCC 4.439 0.4508 9.559 POLY-MFCC 3.747 0.3952 7.178 CEP GMM %EER DCF %EER 8conv-short3 (11849) Short2-short3 (35896) Systems
No calibration: surprisingly, trials with English in either train or test
In submission, we compensated language by splitting trials into
Post submission – We compensate trials with 4 classes – Train-
Does not affect English-English trials
(Submission)
Short2-short3 – Telephone speech Similar improvements as for non English results – better generalization
6.871 6.834 7.095 %EER 0.397 0.372 0.408 DCF(M) 0.538 0.503 0.547 DCF(A) SRE08 2.738 0.137 SRI_2 2.574 0.124 SRI_1 (Nativeness) 2.821 0.140 4CEP %EER DCF(M) System/ Combination SRE06 5.228 5.302 5.303 %EER 0.279 0.274 0.276 DCF(M) 0.309 0.317 0.310 DCF(A) SRE08 2.185 0.113 SRI_2 2.015 0.110 SRI_1 (Nativeness) 2.378 0.116 4CEP %EER DCF(M) System/ Combination SRE06
12 non-English trials
Ordering of systems
More data reduces
Very few errors in
Detailed analysis is
0.7278 18.822 0.8581 25.550 POLY-PROSODIC 0.0733 2.110 0.1914 5.756 SV-MFCC 0.8002 0.8577 0.6750 0.4310 0.3733 0.2624 0.2141 0.2064 0.1350 0.0926 0.1345 0.1009 DCF 19.625 0.9267 25.675 STATE-DUR 23.163 0.8971 28.287 SV-PROSODIC 18.032 0.8011 25.697 WORD-DUR 12.629 0.6359 19.311 WORD-NG 11.036 0.5305 13.891 PROSODIC 7.362 0.4525 12.316 POLY-PLP 5.920 0.4207 12.330 POLY-MFCC 6.315 0.3494 9.655 MLLR_PL 5.267 0.3204 9.929 MLLR 4.083 0.2549 7.331 Constrained GMM 4.341 0.2465 7.345 SV-PLP 2.612 0.2422 7.394 CEP GMM %EER DCF %EER 8conv-short3(4308) Short2-short3(8442) Systems (filled rows = ASR-dep system)
Short2-short3 common condition 5: Telephone training, Altmic test
4BEST = SV-MFCC + SV-PLP + MLLR + PROSODIC (in order of
importance)
Combinations give different relative performance on SRE06 than on SRE08 Nativeness calibration gives small but consistent improvement
4.863 0.161 0.200 1.117 0.045 SRI_2 (8) 4.726 0.150 0.175 0.993 0.039 SRI_1 (14) 4.795 0.153 0.197 1.407 0.047 4CEP 5.685 0.193 0.209 1.780 0.077 1BEST (SV-MFCC) 4.863 0.157 0.186 1.407 0.043 4BEST 4.863 0.151 0.177 1.117 0.044 SRI_1 (14) %EER DCF(M) DCF(A) %EER DCF(M) SRE08 System/ Combination SRE06
With nativeness calibration
st
Achieved highly competitive performance with a combination of
ASR significantly improved, especially for nonnatives, altmic data Single best-performing subsystem: novel cepstral GMM variant
Newly developed and/or improved ASR independent systems:
Performance on interview data relatively good
that we used suboptimal VAD
in question of how to best use the sample data
Four system combination gives comparable performance to our
(constrained GMM, MLLR, prosody)
Order of importance of systems is fairly consistent with more training
Found nativeness calibration for English speakers more important in
Language calibration is critical for good performance
*Constrained cepstral
Up to 3 times reduction
Very few errors in
Detailed analysis is
0.5923 15.004 0.8104 18.752 SV-PROSODIC 0.0545* 1.129* 0.1156 2.629 Constrained GMM 0.3709 0.3797 0.4739 0.5242 0.1482 0.1111 0.1006 0.0696 0.0597 0.0612 0.0583 0.0616 DCF 8.685 0.7910 22.205 WORD-NG 8.685 0.8027 20.241 WORD-DUR 10.957 0.7256 19.081 POLY-PROSODIC 10.191 0.7074 16.281 STATE-DUR 3.401 0.4532 10.694 PROSODIC 3.025 0.2695 5.923 POLY-PLP 1.882 0.2423 6.113 POLY-MFCC 2.635 0.1989 4.606 MLLR_PL 1.882 0.1762 4.441 MLLR 1.559 0.1453 3.782 SV-PLP 1.506 0.1319 3.453 SV-MFCC 1.452 0.1291 2.629 CEP GMM %EER DCF %EER 8conv-short3 (3993) Short2-short3 (8489) Systems (filled rows = ASR-dep)
Short2-short3 common condition 8 – Native English in training and
Although Constrained GMM is the best system on SRE08, the
4BEST = SV-PLP + Constrained GMM + Prosodic + Poly-PLP
1.809 0.099 0.105 0.867 0.052 SRI_1 (14) 2.138 0.111 0.123 1.192 0.063 SRI_2 (8) 2.126 0.106 0.116 1.246 0.064 4CEP 3.783 0.145 0.166 1.788 0.074 1BEST (SV-PLP)* 1.809 0.095 0.104 0.975 0.050 4BEST %EER DCF(M) DCF(A) %EER DCF(M) SRE08 System/ Combination SRE06
SRE06 alt-alt
ASR dependent
1.000 35.797 0.887 24.172 WORD-DUR 0.358 8.622 0.170 3.054 SV-PLP 0.999 0.999 0.926 0.752 0.772 0.668 0.529 0.453 0.366 0.446 0.271 DCF 37.461 0.932 20.946 STATE-DUR 33.267 0.866 24.688 WORD-NG 25.329 0.812 20.064 SV-PROSODIC 18.128 0.652 12.021 POLY-PLP 21.543 0.604 13.312 PROSODIC 15.139 0.560 10.430 POLY-MFCC 12.868 0.392 5.763 Constrained GMM 12.730 0.271 6.946 MLLR_PL 8.561 0.259 3.871 CEP GMM 12.929 0.204 4.839 MLLR 6.387 0.196 3.204 SV-MFCC %EER DCF %EER SRE08 short2-short3 (34181) SRE06 alt-alt (132341) Systems (filled rows = ASR dep)
Short2-short3 – Interview Train and Test SV-PLP is the best min DCF system based on SRE06
DCF values are calibrated well given difference in performance 4BEST systems – SV-PLP, SV-MFCC, POLY-MFCC, MLLR
6.482 0.254 0.271 1.871 0.099 SRI_1 (13) 6.516 0.264 0.275 2.129 0.113 SRI_2 (8) 6.542 0.278 0.279 2.495 0.153 4CEP 8.622 0.358 0.369 3.054 0.170 1BEST (SV-PLP) 7.036 0.278 0.285 2.193 0.121 4BEST %EER DCF(M) DCF(A) %EER DCF(M) SRE08 System/ Combination (w/o nativeness comp) SRE06
0.951 31.702 0.894 25.471 WORD-DUR 0.294 8.359 0.111 2.667 SV-PLP 0.294 0.967 0.972 1.001 0.806 0.611 0.540 0.445 0.399 0.363 0.286 DCF 33.945 0.901 26.621 WORD-NG 29.154 0.880 22.621 SV-PROSODIC 30.479 0.849 22.942 STATE-DUR 21.407 0.547 12.414 PROSODIC 16.106 0.375 9.563 POLY-PLP 14.067 0.327 8.874 POLY-MFCC 13.761 0.240 6.115 MLLR_PL 11.417 0.167 4.552 MLLR 9.582 0.150 3.310 Constrained GMM 7.747 0.149 4.046 CEP GMM 8.461 0.136 3.126 SV-MFCC %EER DCF %EER SRE08 short2-short3 (10719) SRE06 alt-tel (19223) Systems (filled rows = ASR dep)
4.791 0.194 0.269 1.241 0.057 SRI_1 (13) 5.097 0.221 0.271 1.885 0.075 SRI_2 (8) 5.301 0.216 0.263 1.839 0.079 4CEP 8.359 0.286 0.321 2.667 0.111 1BEST (SV-MFCC) 5.505 0.215 0.297 1.563 0.066 4BEST %EER DCF(M) DCF(A) %EER DCF(M) SRE08 System/ Combination (w/o nativeness comp) SRE06
Short2-short3 – “Non English telephone” subset Overall about 30% improvement with correct language calibration
11.103 11.655 13.034 %EER 0.564 0.596 0.639 DCF(M) 0.888 0.998 1.121 DCF(A) SRE08 4.124 0.199 SRI_1, SRI_2 4.294 0.209 4CEP 5.254 0.247 1BEST(SV-PLP) %EER DCF(M) System/ Combination SRE06 8.000 8.207 10.069 %EER 0.420 0.417 0.495 DCF(M) 0.471 0.503 0.618 DCF(A) SRE08 3.051 0.160 SRI_1, SRI_2 3.277 0.166 4CEP 4.294 0.201 1BEST(SV-PLP) %EER DCF(M) System/ Combination SRE06