Residual-based Excitation with Continuous F0 Modeling in HMM-based - - PowerPoint PPT Presentation

residual based excitation with continuous f0 modeling in
SMART_READER_LITE
LIVE PREVIEW

Residual-based Excitation with Continuous F0 Modeling in HMM-based - - PowerPoint PPT Presentation

Residual-based Excitation with Continuous F0 Modeling in HMM-based Speech Synthesis Tams Gbor Csap 1 , Gza Nmeth 1 , Milos Cernak 2 csapot@tmit.bme.hu 1 Budapest University of Technology and Economics 2 Idiap Research Institute SLSP


slide-1
SLIDE 1

Residual-based Excitation with Continuous F0 Modeling in HMM-based Speech Synthesis

Tamás Gábor Csapó1, Géza Németh1, Milos Cernak2

csapot@tmit.bme.hu

1Budapest University of Technology and Economics 2Idiap Research Institute

SLSP 2015 Budapest Nov 24, 2015

slide-2
SLIDE 2

HMM-TTS Excitation model Evaluation Summary

1

HMM-based speech synthesis Excitation models Effect of creaky voice

2

Proposed residual-based excitation model Analysis Training Synthesis

3

Evaluation Listening test

4

Summary and conclusions

2 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-3
SLIDE 3

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

HMM-based speech synthesis

3 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-4
SLIDE 4

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

HMM-based speech synthesis

State-of-the-art Text-To-Speech (TTS) synthesis technique [Zen et al., 2009] Statistical

Generative models with maximum likelihood criterion Hidden Markov-models (HMM)

Parametric

Excitation and spectral modeling Speech signal is encoded to parameters Parameters suitable for statistical modeling Parameters are decoded to speech

4 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-5
SLIDE 5

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

Excitation models in HMM-TTS

Goal: model human speech production Source-filter separation [Fant, 1960] Excitation model types [Hu et al., 2013]

Impulse-noise Mixed excitation Glottal source Harmonic plus noise Sinusoidal Residual-based

5 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-6
SLIDE 6

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

Effect of creaky voice

Creaky voice

Irregular vibration of vocal folds Abrupt changes in F0 (fundamental frequency, pitch) and/or amplitudes Perceived as rough voice Up to 15% of vowels of natural speech

Effect of creaky voice on HMM-TTS

Can cause problems for standard speech analysis methods (e.g. F0 tracking and spectral analysis) Voiced / unvoiced error is learned during training Audible distortions in synthesized sentences

6 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-7
SLIDE 7

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

Creaky voice sample

−0.4 −0.2 0.2 0.4 0.6 0.8

Amplitude a) regions of creaky voice

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 50 100 150 200 250 300

Frequency (Hz) Time (s) b) standard F0 tracking

’Eggshell is not good to eat.’ (sample)

7 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-8
SLIDE 8

HMM-TTS Excitation model Evaluation Summary Excitation models Effect of creaky voice

Creaky voice sample

−0.4 −0.2 0.2 0.4 0.6 0.8

Amplitude a) regions of creaky voice

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 50 100 150 200 250 300

Frequency (Hz) Time (s) b) standard F0 tracking

’Eggshell is not good to eat.’ (sample)

8 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-9
SLIDE 9

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Proposed residual-based excitation model

9 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-10
SLIDE 10

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Block diagram of analysis

10 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-11
SLIDE 11

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Block diagram of analysis

11 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-12
SLIDE 12

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Analysis: PCA-based residual

Inverse filtered residual Pitch synchronous framing Earlier excitation models:

Store frames in a codebook Select frames from codebook during synthesis

Proposed model:

Window and resample frames to fixed length Apply Principal Component Analysis (PCA) Use first PCA component later

12 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-13
SLIDE 13

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Analysis: PCA-based residual

50 100 150 200 250 0.5 0.0 0.5 Normalized amplitude a) PCA residual for EN-M-AWB 20 40 60 80 100 120 140 160 Time (samples) 0.5 0.0 0.5 Normalized amplitude b) PCA residual for EN-F-SLT 13 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-14
SLIDE 14

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Analysis: continuous F0 modeling

Traditional F0 trackers

F0 is discontinuous, jumps occur at voiced-unvoiced transitions HMMs can model continuous functions efficiently Multi-Space Distribution (MSD) necessary for traditional F0 [Tokuda et al., 2002]

Simple continuous pitch tracker ’F0cont’ [Garner et al., 2013]

Standard autocorrelation No voiced/unvoiced decision Kalman smoothing-based interpolation Interpolates F0 in regions of creaky voice No need for MSD during training

14 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-15
SLIDE 15

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Analysis: Maximum Voiced Frequency

Divide spectrum to two frequency bands

Lower frequency band: voiced Higher frequency band: unvoiced

Earlier excitation models:

Boundary between frequency bands fixed (at 6 kHz)

Proposed excitation model:

Boundary between frequency bands varying Maximum Voiced Frequency (MVF) [Drugman and Stylianou, 2014]

15 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-16
SLIDE 16

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Training with proposed model

Parameters calculated for each 25 ms frame

MGC: Mel-Generalized Cepstrum F0cont: continuous pitch track MVF: Maximum Voiced Frequency

Decision tree-based context clustering and Context dependent labeling [Zen et al., 2007] Independent decision trees for all the parameters and duration using a maximum likelihood criterion

16 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-17
SLIDE 17

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Block diagram of synthesis

17 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-18
SLIDE 18

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Block diagram of synthesis

18 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-19
SLIDE 19

HMM-TTS Excitation model Evaluation Summary Analysis Training Synthesis

Synthesis features

PCA residual overlap-added according to F0cont Voiced and unvoiced excitation component added together according to MVF MVF models voicing

for unvoiced sounds, the MVF is low (around 1 kHz) for voiced sounds, the MVF is high (above 4 kHz) for mixed excitation sounds, the MVF is in between (e.g. for voiced fricatives, MVF is around 2-3 kHz)

Spectral filtering according to MGC

19 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-20
SLIDE 20

HMM-TTS Excitation model Evaluation Summary Listening test

Evaluation

20 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-21
SLIDE 21

HMM-TTS Excitation model Evaluation Summary Listening test

Data

Two English speakers from CMU-ARCTIC database [Kominek and Black, 2003]

EN-M-AWB (Scottish English, male) EN-F-SLT (American English, female) Both produced irregular phonation frequently, mostly at the end of sentences

16 kHz sampling 1132 sentences from each speaker, single speaker training Text processing using the Festival TTS front-end (e.g. phonetic transcription, labeling, etc.)

21 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-22
SLIDE 22

HMM-TTS Excitation model Evaluation Summary Listening test

System A: HTS-F0std (baseline)

standard pitch tracking voiced / unvoiced boundary fixed at 6 kHz

0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 a) HTS-F0std, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 b) HTS-F0std, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 c) HTS-F0std+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 Frequency (Hz) d) HTS-F0std+MVF, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 e) HTS-F0cont+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 Time (s) 2000 4000 6000 8000 f) HTS-F0cont+MVF, spectrogram and MVF

Time (s) Frequency (Hz) Frequency (Hz)

’Please Mom, is this New Zealand, or Australia?’ (sample)

22 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-23
SLIDE 23

HMM-TTS Excitation model Evaluation Summary Listening test

System B: HTS-F0std+MVF

standard pitch tracking voiced / unvoiced boundary according to MVF parameter

0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 a) HTS-F0std, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 b) HTS-F0std, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 c) HTS-F0std+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 Frequency (Hz) d) HTS-F0std+MVF, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 e) HTS-F0cont+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 Time (s) 2000 4000 6000 8000 f) HTS-F0cont+MVF, spectrogram and MVF

Time (s) Frequency (Hz) Frequency (Hz)

’Please Mom, is this New Zealand, or Australia?’ (sample)

23 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-24
SLIDE 24

HMM-TTS Excitation model Evaluation Summary Listening test

System C: HTS-F0cont+MVF

continuous pitch tracking voiced / unvoiced boundary according to MVF parameter

0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 a) HTS-F0std, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 b) HTS-F0std, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 c) HTS-F0std+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 2000 4000 6000 8000 Frequency (Hz) d) HTS-F0std+MVF, spectrogram and MVF 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250 e) HTS-F0cont+MVF, F0 0.5 1.0 1.5 2.0 2.5 3.0 Time (s) 2000 4000 6000 8000 f) HTS-F0cont+MVF, spectrogram and MVF

Time (s) Frequency (Hz) Frequency (Hz)

’Please Mom, is this New Zealand, or Australia?’ (sample) [ A, B, C ]

24 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-25
SLIDE 25

HMM-TTS Excitation model Evaluation Summary Listening test

Listening test

Web-based paired comparison test with one CMOS-like question 3 systems, 10 sentences, 2 speakers Which of the sentences is more natural?

1: first much more natural 2: first more natural 3: equal 4: second more natural 5: second is much more natural

8 listeners, not native speakers of English

http://leszped.tmit.bme.hu/slsp2015_en/

25 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-26
SLIDE 26

HMM-TTS Excitation model Evaluation Summary Listening test

Results of the listening test

Speaker SLT (female)

System A < System B < System C (sample A), (sample B), (sample C) Proposed excitation model preferred

Speaker AWB (male)

System C < System B = System A Probably because high background noise Vocoding caused audible artifacts

26 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-27
SLIDE 27

HMM-TTS Excitation model Evaluation Summary

Summary and conclusions

27 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-28
SLIDE 28

HMM-TTS Excitation model Evaluation Summary

Summary and conclusions

Novel residual-based excitation model

PCA-based residual Continuous F0 modeling Maximum Voiced Frequency

Evaluation

Improvement in perceived naturalness (for female) Effect of creaky voice eliminated Disturbing artifacts caused by unwanted voicing

Possible application

TTS on smart devices (e.g. Android smartphones) Personalized systems

28 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-29
SLIDE 29

HMM-TTS Excitation model Evaluation Summary

Future directions

Improved modeling of the unvoiced sounds

Rule-based voiced/unvoiced decision New parameter for voicing (e.g. Harmonics-To-Noise)

Vocoding

Application in low bitrate speech coding

29 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-30
SLIDE 30

HMM-TTS Excitation model Evaluation Summary

Thank you for your attention!

Tamás Gábor Csapó, Géza Németh, Milos Cernak, „Residual-based Excitation with Continuous F0 Modeling in HMM-based Speech Synthesis” csapot@tmit.bme.hu

Supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF no IZ73Z0 152495-1) and by the EITKIC project (EITKIC 12-1-2012-001).

30 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-31
SLIDE 31

HMM-TTS Excitation model Evaluation Summary

References I

Drugman, T. and Stylianou, Y. (2014). Maximum Voiced Frequency Estimation : Exploiting Amplitude and Phase Spectra. IEEE Signal Processing Letters, 21(10):1230–1234. Fant, G. (1960). Acoustic theory of speech production. Mouton, The Hague. Garner, P . N., Cernak, M., and Motlicek, P . (2013). A simple continuous pitch estimation algorithm. IEEE Signal Processing Letters, 20(1):102–105. Hu, Q., Richmond, K., Yamagishi, J., and Latorre, J. (2013). An experimental comparison of multiple vocoder types. In Proc. ISCA SSW8, pages 155–160. Kominek, J. and Black, A. W. (2003). CMU ARCTIC databases for speech synthesis. Technical report, Language Technologies Institute. Tokuda, K., Mausko, T., Miyazaki, N., and Kobayashi, T. (2002). Multi-space probability distribution HMM. IEICE Transactions on Information and Systems, E85-D(3):455–464. 31 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS

slide-32
SLIDE 32

HMM-TTS Excitation model Evaluation Summary

References II

Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., and Black, A. (2007). The HMM-based speech synthesis system version 2.0. In Proc. ISCA SSW6, pages 294–299, Bonn, Germany. Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064. 32 / 30 Tamás Gábor Csapó, Géza Németh, Milos Cernak Residual-based Excitation with Continuous F0 in HMM-TTS