Voice Conversion and Anti-spoofing of Speaker Verification
Haizhou Li
Acknowledgement:
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian
1
Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi - - PowerPoint PPT Presentation
Voice Conversion and Anti-spoofing of Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian 1 Agenda Spoofing Attacks Voice Conversion Artifacts ASVspoof
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian
1
2
3
4
5
Speech Communication, vol. 66, pp. 130–153, 2015.
6
7
Information and Engineering Systems. Springer, 2005, pp. 907–907.
Idiap- RR-61-2005), 2005.
verification systems against voice mimicry,” in Interspeech 2013
Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014.
8
9
Time (Seconds) Frequency (Hz)
1.0 2.0 3.0 8000 Time (Seconds) Frequency (Hz)
1.0 2.0 3.0 8000
10
2003, pp. 7–13.
speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014.
11
Speech Communication, vol. 66, pp. 130–153, 2015.
Tomi Kinnunen and Haizhou Li, “An Overview of Text-Independent Speaker Recognition: from Features to Supervectors”, Speech Communication 52(1): 12--40, January 2010
12
13
14
15
Analysis Feature conversion Synthesis
16
Analysis Feature conversion Synthesis
Analysis Synthesis
17
18
Analysis Synthesis
Feature EER (%) MFCC 10.98 MGDCC 1.25 MGDCC+PM 0.89
19
Analysis Feature conversion Synthesis
20
21
22
23
vector quantization." ICASSP 1988
24
25
Transactions on Audio, Speech, and Language Processing, 18, no. 5 (2010): 922-931.
frequency warping based voice conversion", ICASSP 2015
26
conversion system based on frame selection." ICASSP 2007.
voice conversion utilizing temporal information", Interspeech 2013
28
29
30
31
32
Hideki Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds, Acoust. Sci. & Tech. 27, 6 (2006)
33
Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura “Speech Synthesis Based on Hidden Markov Models” Proceedings of The IEEE, 2013
speech synthesis,” in ICASSP 1998.
3 4
Tian Xiaohai
35
SHIOZAKI, Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification, IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.1 JANUARY 2005
“Synthetic speech discrimination using pitch pattern statistics derived from image analysis,” in Proc. Interspeech, 2012.
“Performance of ivector speaker verification and the detection of synthetic speech,” in Proc. IEEE
Processing (ICASSP), 2014.
Imposture Using an HMM-based Speech Synthesis System”, EUROSPEECH 2001
recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 36, no. 6, pp. 871–879, Jun 1988
INTERSPEECH-2015, 2087-2091.
36
37
Chng, Haizhou Li, "Synthetic speech detection using temporal modulation feature", ICASSP 2013.
Hermansky, “Robust feature extraction using modulation filtering of autoregressive models,” IEEE/ACM T- ASLP vol. 22, no. 8, pp. 1285–1295, Aug. 2014.
38
Oppenheim, Schafer & Buck, Discrete time digital signal processing, 2nd Edition, Prentice Hall
spectrum in speech processing: A review and some experimental results,” Digital Signal Processing, 2007.
39
Murthy, “Significance of group delay functions in spectrum estimation,” IEEE Transactions on Signal Processing,1992 Leigh D. Alsteris and Kuldip K. Paliwal, “Evaluation of the modified group delay feature for isolated word recognition”, Int. Symposium
Applications 2005
40
41
Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Eng Siong Chng, Haizhou Li, "Spoofing Speech Detection Using High Dimensional Magnitude and Phase Features: the NTU Approach for ASVspoof 2015 Challenge", Interspeech 2015
Organisers Zhizheng Wu, University of Edinburgh, UK Tomi Kinnunen, University of Eastern Finland, Finland Nicholas Evans, EURECOM, France Junichi Yamagishi, University of Edinburgh, UK
42
# utterances Algorithm Vocoder Training
(10 male/15 female)
Development
(15 male/20 female)
Evaluation
(20 male/26 female)
Genuine 3750 3497 9404
None None
S1 2525 9975 18400
VC :Frame-selection STRAIGHT
S2 2525 9975 18400
VC: Slope-shifting STRAIGHT
S3 2525 9975 18400
SS: HMM STRAIGHT
S4 2525 9975 18400
SS: HMM STRAIGHT
S5 2525 9975 18400
VC: GMM MLSA
S6 18400
VC: GMM STRAIGHT
S7 18400
VC: GMM STRAIGHT
S8 18400
VC: Tensor STRAIGHT
S9 18400
VC: KPLS STRAIGHT
S10 18400
SS: unit-selection None
43
44
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans and Junichi Yamagishi, "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge", IEEE Signal Processing Society Speech and Language Technical Committee Newsletter (SLTC Newsletter), 20 November 2015
Four times higher than that of known attacks
Team Average (all) Average (without S10) S10 A 1.211 0.402 8.490 B 1.965 0.008 19.571 C 2.528 0.076 24.601 D 2.617 0.003 26.142 E 2.694 0.060 26.393 F 3.218 0.400 28.581 G 3.326 0.360 30.021 H 3.726 0.021 37.068 I 3.898 0.703 32.651 J 4.097 0.029 40.708 K 4.547 0.203 43.638 L 6.719 3.478 35.890 M 14.391 12.482 31.574 N 14.568 11.299 43.991 O 18.826 16.304 41.519 P 21.518 18.786 46.102
45
46
4 7
49
conditions”, IEEE Transactions on Audio, Speech and Language Processing, vol 19, no 6, pp. 1791-1801, 2011
Constant Q Cepstral Coefficients”, Odyssey 2016
A comparison of the time-frequency resolution of the STFT (top) and CQT (down).*
Constant Q Cepstral Coefficients”, Odyssey 2016
temporal response patterns recorded from auditory cortex,” Proc. Natl. Acad. Sci., vol. 98, no. 23, pp. 13 367–13 372, 2001
𝑡
Constant-Q Transform Power spectrum LOG DCT
speech signal
Uniform resampling
CQCC
52
*K.K.Paliwal, et al,“Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition,”
** Duc Hoang Ha Nguyen, Xiong Xiao, Eng Siong Chng, Haizhou Li, “Feature Adaptation Using Linear Spectro-Temporal Transform for Robust Speech Recognition”, IEEE/ACM Trans. Audio, Speech & Language Processing 24(6): 1006-1019 (2016)
53