Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi - - PowerPoint PPT Presentation

speaker verification
SMART_READER_LITE
LIVE PREVIEW

Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi - - PowerPoint PPT Presentation

Voice Conversion and Anti-spoofing of Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian 1 Agenda Spoofing Attacks Voice Conversion Artifacts ASVspoof


slide-1
SLIDE 1

Voice Conversion and Anti-spoofing of Speaker Verification

Haizhou Li

Acknowledgement:

Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian

1

slide-2
SLIDE 2

Agenda

  • Spoofing Attacks
  • Voice Conversion
  • Artifacts
  • ASVspoof 2015

2

slide-3
SLIDE 3

Agenda

  • Spoofing Attacks
  • Voice Conversion
  • Artifacts
  • ASVspoof 2015

3

slide-4
SLIDE 4

Speaker Verification

Speaker Verification

Yes, John!

This is John! Reject!

4

slide-5
SLIDE 5

Speaker Verification

Yes, John! Impersonation Replay Speech Synthesis Voice Conversion Spoofing Attacks

This is John! Reject!

5

slide-6
SLIDE 6

Spoofing attack Accessibility Effectiveness (risk) Countermeasure availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Replay High Low Low to high Low Speech synthesis Medium to high High High Medium Voice conversion Medium to high High High Medium

  • Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: a survey,”

Speech Communication, vol. 66, pp. 130–153, 2015.

Spoofing Attacks

6

slide-7
SLIDE 7

7

Spoofing attack Accessibility Effectiveness (risk) Countermeasure availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Replay High Low Low to high Low Speech synthesis Medium to high High High Medium Voice conversion Medium to high High High Medium

  • Y. Lau, D. Tran, and M. Wagner, “Testing voice mimicry with the YOHO speaker verification corpus,” in Knowledge-Based Intelligent

Information and Engineering Systems. Springer, 2005, pp. 907–907.

  • J. Mariethoz and S. Bengio, “Can a professional imitator fool a GMM based speaker verification system?” IDIAP Research Report (No.

Idiap- RR-61-2005), 2005.

  • R. G. Hautamaki, T. Kinnunen, V. Hautamaki, T. Leino, and A.-M. Laukkanen, “I-vectors meet imitators: on vulnerability of speaker

verification systems against voice mimicry,” in Interspeech 2013

Impersonation

slide-8
SLIDE 8

Spoofing attack Accessibility Effectiveness (risk) Countermeasure availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Replay High Low Low to high Low Speech synthesis Medium to high High High Medium Voice conversion Medium to high High High Medium

Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014.

Replay

8

slide-9
SLIDE 9

9

Traits of Replay

  • J. Villalba and E. Lleida, “Preventing replay attack on speaker verification systems’, IEEE ICCST 2011
  • L. Cuccovillo, P. Aichroth, “Open-set microphone classification via blind channel analysis”, ICASSP 2016
slide-10
SLIDE 10

Time (Seconds) Frequency (Hz)

Genuine speech

1.0 2.0 3.0 8000 Time (Seconds) Frequency (Hz)

Replay speech

1.0 2.0 3.0 8000

10

  • 1. A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Symposium on Music Information Retrieval (ISMIR),

2003, pp. 7–13.

  • 2. Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent

speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014.

Audio Fingerprinting

slide-11
SLIDE 11

Spoofing attack Accessibility Effectiveness (risk) Countermeasure availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Replay High Low Low to high Low Speech synthesis Medium to high High High Medium Voice conversion Medium to high High High Medium

11

  • Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: a survey,”

Speech Communication, vol. 66, pp. 130–153, 2015.

Spoofing Attacks

slide-12
SLIDE 12

Tomi Kinnunen and Haizhou Li, “An Overview of Text-Independent Speaker Recognition: from Features to Supervectors”, Speech Communication 52(1): 12--40, January 2010

Speaker Verification: Robust Features

12

  • Modeling the human voice

production system

  • Modeling the peripheral auditory

system

slide-13
SLIDE 13

This is John! Speaker Verification Yes, John! Synthetic Speech Detection

More Robust = More Vulnerable

No Reject! Reject!

13

slide-14
SLIDE 14

Agenda

14

  • Spoofing Attacks
  • Voice Conversion
  • Artifacts
  • ASVspoof 2015
slide-15
SLIDE 15

15

Voice Conversion: Vocoder

Analysis Feature conversion Synthesis

Source Speaker Target Speaker

slide-16
SLIDE 16

16

Vocoder: Analysis - Synthesis

Analysis Feature conversion Synthesis

Source Speaker Target Speaker

Analysis Synthesis

source target

slide-17
SLIDE 17

Sinusoidal vocoders Harmonic plus noise model (HNM) vocoder Harmonic and stochastic vocoder Adaptive harmonic vocoder Source-filter model Linear predictive vocoder Mel – generalised cepstral vocoder STRAIGHT Glottal vocoder

Vocoder

17

slide-18
SLIDE 18

18

Vocoder: Copy Synthesis

Analysis Synthesis

Source Target

  • Z. Wu, X. Xiao, E.S. Chng, H. Li, “Synthetic Speech Detection Using Temporal Modulation Feature”, ICASSP 2013

Feature EER (%) MFCC 10.98 MGDCC 1.25 MGDCC+PM 0.89

slide-19
SLIDE 19

19

Voice Conversion: Feature Conversion

Analysis Feature conversion Synthesis

Source Speaker Target Speaker

slide-20
SLIDE 20

20

Speaker A Speaker B

  • Z. Wu, Spectral Mapping for Voice Conversion, Ph.D Thesis, Nanyang Technological University, 2015

Differences between Speakers

slide-21
SLIDE 21

21

Training Conversion

Basics of Voice Conversion

slide-22
SLIDE 22

22

  • Z. Wu et al, Tutorial Notes, APSIPA ASC 2015

Chronological Map of Voice Conversion

slide-23
SLIDE 23

23

  • Z. Wu, Spectral mapping for voice conversion, Ph.D Thesis, Nanyang Technological University, 2015
  • Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara. "Voice conversion through

vector quantization." ICASSP 1988

Voice Conversion: Codebook Mapping

source target

slide-24
SLIDE 24

24

  • Alexander Kain, and Michael W. Macon. "Spectral voice conversion for text-to-speech synthesis." ICASSP 1998

Voice Conversion: Joint Density GMM

source target

slide-25
SLIDE 25

Source spectrum Target spectrum

25

  • Daniel Erro, Asunción Moreno, and Antonio Bonafonte. "Voice conversion based on weighted frequency warping." IEEE

Transactions on Audio, Speech, and Language Processing, 18, no. 5 (2010): 922-931.

  • Xiaohai Tian, Zhizheng Wu, Siu Wa Lee, Nguyen Quy Hy, Eng Siong Chng, Minghui Dong, "Sparse representation for

frequency warping based voice conversion", ICASSP 2015

Voice Conversion: Frequency Warping

Use partially the source spectrum information

slide-26
SLIDE 26

26

  • Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J. M. Perez, and Yannis Stylianou. "Towards a voice

conversion system based on frame selection." ICASSP 2007.

  • Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng, Haizhou Li, "Exemplar-based unit selection for

voice conversion utilizing temporal information", Interspeech 2013

Voice Conversion: Frame/Unit Selection

slide-27
SLIDE 27

# # # # #

dh dh dh dh

ax ax ax ax ax c c c c ae ae ae ae ae ae t t t t t s s s s ae ae ae ae ae t t t # # # # # # # dh ax c ae t s ae t

Unit Selection Synthesis

  • Source symbol –Target segment costs: suitability of unit for target
  • Target segment -Target segment costs: acoustic continuity of two adjacent units
  • Z. Wu et al, Tutorial Notes, APSIPA ASC 2015
slide-28
SLIDE 28

28

Subjective Analysis Objective Analysis “Spoofing Analysis”

  • 1. Spectral distortion
  • 2. Temporal (magnitude/phase) discontinuity
  • 3. Spectro-temporal artifacts
  • 4. Pitch pattern
  • 5. ASVspoof 2015 ?

Evaluation of Synthetic Voice

slide-29
SLIDE 29

Agenda

29

  • Spoofing Attacks
  • Voice Conversion
  • Artifacts
  • ASVspoof 2015
slide-30
SLIDE 30

Magnitude

  • Short-time Fourier transform (STFT)
  • Smoothing effect (local vs global optimization)
  • Temporal magnitude discontinuity

Phase

  • Minimum phase vocoding
  • Phase distortion
  • Temporal phase discontinuity

30

Artifacts

… that are common to synthetic speech … that are different from natural speech

slide-31
SLIDE 31
  • Time-Frequency resolution
  • Spectral leakage
  • Windowing tradeoffs

31

Magnitude: STFT

slide-32
SLIDE 32

32

Hideki Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds, Acoust. Sci. & Tech. 27, 6 (2006)

Magnitude: Smoothing in Vocoder

slide-33
SLIDE 33

33

Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura “Speech Synthesis Based on Hidden Markov Models” Proceedings of The IEEE, 2013

Magnitude : Smoothing in synthesized/converted speech

  • A. Kain and M. W. Macon, “Spectral voice conversion for text-to-

speech synthesis,” in ICASSP 1998.

slide-34
SLIDE 34

3 4

Tian Xiaohai

Natural speech Copy synthetic speech Absolute difference

  • X. Tian, Z. Wu, X. Xiao, E. S. Chng, H. Li, "Spoofing detection from a feature representation perspective", ICASSP 2016

Magnitude : Log Magnitude Spectrum

slide-35
SLIDE 35

35

Magnitude: Pitch patterns in HMM-based synthesized speech

  • Akio OGIHARA, Hitoshi UNNO, and Akira

SHIOZAKI, Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification, IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.1 JANUARY 2005

  • P. L. De Leon, B. Stewart, and J. Yamagishi,

“Synthetic speech discrimination using pitch pattern statistics derived from image analysis,” in Proc. Interspeech, 2012.

  • R. D. McClanahan, B. Stewart, and P. L. De Leon,

“Performance of ivector speaker verification and the detection of synthetic speech,” in Proc. IEEE

  • Int. Conf. on Acoustics, Speech, and Signal

Processing (ICASSP), 2014.

slide-36
SLIDE 36
  • Inter-Frame Difference of Log-Likelihood (IFDLL)

∆𝒎𝒖 = |𝒎𝒑𝒉 𝒒(𝒚𝒐|λ𝑫)- 𝒎𝒑𝒉 𝒒(𝒚𝒐−𝟐|λ𝑫)|

  • Δ-Cepstrum and Δ2-Cepstrum

Magnitude: Difference of Log-Likelihood

  • Takayuki Satoh, Takashi Masuko, Takao Kobayashi, Keiichi Tokuda, “A Robust Speaker Verification System against

Imposture Using an HMM-based Speech Synthesis System”, EUROSPEECH 2001

  • F. Soong and A. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker

recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 36, no. 6, pp. 871–879, Jun 1988

  • Sahidullah, Md., Kinnunen, Tomi, Hanilçi, Cemal (2015): "A comparison of features for synthetic speech detection", In

INTERSPEECH-2015, 2087-2091.

36

slide-37
SLIDE 37

37

  • Zhizheng Wu, Xiong Xiao, Eng Siong

Chng, Haizhou Li, "Synthetic speech detection using temporal modulation feature", ICASSP 2013.

  • S. Ganapathy, S. H. Mallidi, and H.

Hermansky, “Robust feature extraction using modulation filtering of autoregressive models,” IEEE/ACM T- ASLP vol. 22, no. 8, pp. 1285–1295, Aug. 2014.

Magnitude: Temporal Modulation Feature

slide-38
SLIDE 38
  • Wrapping
  • Discontinuity
  • Distortion

38

Phase

Oppenheim, Schafer & Buck, Discrete time digital signal processing, 2nd Edition, Prentice Hall

slide-39
SLIDE 39

Instantaneous Frequency

  • L. D Alsteris and K. K. Paliwal,“Short-time phase

spectrum in speech processing: A review and some experimental results,” Digital Signal Processing, 2007.

Time-derivative of phase for signal: 𝒕𝒃 𝒖 = 𝒃 𝒖 𝒇𝒌φ(𝒖) Instantaneous Frequency: 𝒈(𝒖) =

𝟐 𝟑𝝆 𝑒𝝌(𝒖) 𝑒𝒖

39

slide-40
SLIDE 40

Frequency-derivative of phase

Group Delay Function

  • B. Yegnanarayana and H. A

Murthy, “Significance of group delay functions in spectrum estimation,” IEEE Transactions on Signal Processing,1992 Leigh D. Alsteris and Kuldip K. Paliwal, “Evaluation of the modified group delay feature for isolated word recognition”, Int. Symposium

  • n Signal Processing and Its

Applications 2005

40

slide-41
SLIDE 41

High-dimensional features: phase & magnitude

41

Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Eng Siong Chng, Haizhou Li, "Spoofing Speech Detection Using High Dimensional Magnitude and Phase Features: the NTU Approach for ASVspoof 2015 Challenge", Interspeech 2015

ASVspoof 2015: System D (NTU)

slide-42
SLIDE 42
  • Spoofing Attacks
  • Voice Conversion
  • Artifacts
  • ASVspoof 2015

ASVspoof 2015: Speaker verification spoofing and countermeasures challenge

Organisers Zhizheng Wu, University of Edinburgh, UK Tomi Kinnunen, University of Eastern Finland, Finland Nicholas Evans, EURECOM, France Junichi Yamagishi, University of Edinburgh, UK

42

slide-43
SLIDE 43

# utterances Algorithm Vocoder Training

(10 male/15 female)

Development

(15 male/20 female)

Evaluation

(20 male/26 female)

Genuine 3750 3497 9404

None None

S1 2525 9975 18400

VC :Frame-selection STRAIGHT

S2 2525 9975 18400

VC: Slope-shifting STRAIGHT

S3 2525 9975 18400

SS: HMM STRAIGHT

S4 2525 9975 18400

SS: HMM STRAIGHT

S5 2525 9975 18400

VC: GMM MLSA

S6 18400

VC: GMM STRAIGHT

S7 18400

VC: GMM STRAIGHT

S8 18400

VC: Tensor STRAIGHT

S9 18400

VC: KPLS STRAIGHT

S10 18400

SS: unit-selection None

Voice Conversion Algorithms

43

slide-44
SLIDE 44

44

16 Primary Submission Results (EER)

Zhizheng Wu, Tomi Kinnunen, Nicholas Evans and Junichi Yamagishi, "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge", IEEE Signal Processing Society Speech and Language Technical Committee Newsletter (SLTC Newsletter), 20 November 2015

Four times higher than that of known attacks

slide-45
SLIDE 45

Team Average (all) Average (without S10) S10 A 1.211 0.402 8.490 B 1.965 0.008 19.571 C 2.528 0.076 24.601 D 2.617 0.003 26.142 E 2.694 0.060 26.393 F 3.218 0.400 28.581 G 3.326 0.360 30.021 H 3.726 0.021 37.068 I 3.898 0.703 32.651 J 4.097 0.029 40.708 K 4.547 0.203 43.638 L 6.719 3.478 35.890 M 14.391 12.482 31.574 N 14.568 11.299 43.991 O 18.826 16.304 41.519 P 21.518 18.786 46.102

45

Look at S10

slide-46
SLIDE 46

S1-S5: voice conversion in train/dev/eval sets S1: VC - Frame selection S2: VC - Slope shifting S3: TTS – HTS with 20 adaptation sentences S4: TTS – HTS with 40 adaptation sentences S5: VC – Festvox (http://festvox.org/)

46

Listen to S10

S6 – S10: Only appear in the eval set S6: VC – ML-GMM with GV enhancement S7: VC – Similar to S6 but using LSP features S8: VC – Tensor (eigenvoice)-based approach S9: VC – Nonlinear regression (KPLS) S10: TTS – MARY TTS unit selection (http://mary.dfki.de/)

slide-47
SLIDE 47

4 7

  • X. Tian, Z. Wu, X. Xiao, E. S. Chng, H. Li, "Spoofing detection from a feature representation perspective", ICASSP 2016

Four Features, 1 hidden layer Neural Netwok

slide-48
SLIDE 48

S1 S3 S2 S4 S9 S5

S10 Natural

S8 S7 S6 LMS (using S1-S5 for training)

Spoofed: 0 Natural: 1

score

LMS, Temporal CNN over 100 frames

slide-49
SLIDE 49

49

Fourier Transform vs Auditory Transform

  • Q. Li and Y. Huang, “An auditory-based feature extraction algorithm for robust speaker identification under mismatched

conditions”, IEEE Transactions on Audio, Speech and Language Processing, vol 19, no 6, pp. 1791-1801, 2011

  • Massimiliano Todisco, H´ector Delgado and Nicholas Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing:

Constant Q Cepstral Coefficients”, Odyssey 2016

slide-50
SLIDE 50
  • In STFT, the time and

frequency resolutions are constant.

  • CQT employs a variable

time/frequency resolution:

  • greater time resolution for

higher frequencies

  • greater frequency resolution for

lower frequencies

A comparison of the time-frequency resolution of the STFT (top) and CQT (down).*

  • Massimiliano Todisco, H´ector Delgado and Nicholas Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing:

Constant Q Cepstral Coefficients”, Odyssey 2016

  • E. Ahissar, S. Nagarajan, M. Ahissar, A. Protopapas, H. Mahncke, and M. M. Merzenich, “Speech comprehension is correlated with

temporal response patterns recorded from auditory cortex,” Proc. Natl. Acad. Sci., vol. 98, no. 23, pp. 13 367–13 372, 2001

Constant Q Cepstral Coefficient

𝑂𝑙 =

𝑔

𝑡

𝑔𝑙 Q

𝑅 = 𝑔

𝑙

δ𝑔

*Courtesy of Nick Evans

slide-51
SLIDE 51

Block diagram of CQCC feature extraction Comparison of results (EER [%]) on ASVspoof2015 Database Front-end: CQCC-A (19+0th second derivative coefficients) Back-end: 2 GMMs (512 components, EM training), one for human speech and one for spoofed speech Matlab implementation of CQCC extraction can be downloaded from http://audio.eurecom.fr/content/software

Constant-Q Transform Power spectrum LOG DCT

speech signal

Uniform resampling

CQCC

Constant Q Cepstral Coefficients (CQCC)

*Courtesy of Nick Evans

slide-52
SLIDE 52
  • Most systems assume natural speech

inputs

  • More robust = More vulnerable
  • Better speech perceptual quality ≠ less

artifacts*

  • Machines (frame-by-frame) and Humans

(spectro-temporal) listen in different ways**

  • Features are more important than

classifiers

Spoofing: Challenges and Opportunities

52

*K.K.Paliwal, et al,“Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition,”

  • Int. Conf.Sig. Proce.and Comm.Sys.,Gold Coast, Australia, ICSPCS, Dec.2010

** Duc Hoang Ha Nguyen, Xiong Xiao, Eng Siong Chng, Haizhou Li, “Feature Adaptation Using Linear Spectro-Temporal Transform for Robust Speech Recognition”, IEEE/ACM Trans. Audio, Speech & Language Processing 24(6): 1006-1019 (2016)

slide-53
SLIDE 53

Thank You!

53