A novel irregular voice model for HMM-based speech synthesis Tams - - PowerPoint PPT Presentation

a novel irregular voice model for hmm based speech
SMART_READER_LITE
LIVE PREVIEW

A novel irregular voice model for HMM-based speech synthesis Tams - - PowerPoint PPT Presentation

A novel irregular voice model for HMM-based speech synthesis Tams Gbor Csap, Gza Nmeth Budapest University of Technology and Economics, Hungary Dept. of Telecommunications and Media Informatics 8th Speech Synthesis Workshop 2013


slide-1
SLIDE 1

A novel irregular voice model for HMM-based speech synthesis

8th Speech Synthesis Workshop 2013 September 2 Barcelona, Spain

Tamás Gábor Csapó, Géza Németh

Budapest University of Technology and Economics, Hungary

  • Dept. of Telecommunications and Media Informatics
slide-2
SLIDE 2

Contents

  • Excitation models in HMM-TTS
  • Irregular voice and its models
  • Novel irregular voice model
  • Perceptual & acoustic evaluation

2/32

slide-3
SLIDE 3

INTRODUCTION

3/32

slide-4
SLIDE 4

Speech excitation models in HMM-TTS

  • Goal: model human speech production
  • Source-filter separation [Fant’60]
  • Types

[Hu;’13] SSW8

– Impulse-noise – Mixed excitation – Glottal source – Harmonic plus noise – Residual based

4/32

slide-5
SLIDE 5

Linear Prediction residual of speech

  • e
  • 5/32
slide-6
SLIDE 6

Irregular voice: occurrance

  • Irregular vibration of vocal folds

[Blomgren;’98] [Gobl&Chasaide’03]

– Irregular F0 and/or amplitudes

  • Creaky voice, laryngealization, vocal fry,

glottalization

  • Up to 15% of vowels of natural speech [Bőhm;’09]
  • Location [Dilley;’96]

– Phrase boundaries – Sentence endings – Vowel-vowel transitions

6/32

slide-7
SLIDE 7

Irregular voice: example

7/32

slide-8
SLIDE 8

Irregular voice: acoustic properties

  • Differences compared to regular speech

[Klatt&Klatt’90] [Bőhm;’09]

– time between successive glottal pulses longer and more irregular – lower F0 and higher jitter – abrupt changes in the amplitude of the periods – lowered open quotient (proportion of the glottal cycle where the glottis is open) – increased first formant bandwidth because of more acoustic losses at the glottis – more abrupt closure of the vocal folds

8/32

slide-9
SLIDE 9

Irregular voice: models in HMM-TTS

  • [Silén;’09] Interspeech

– Robust F0 measure and two-band voicing – Not focusing on characteristics of irregular voice

  • [Drugman;’12] Interspeech

– Extension of DSM model: secondary pulses in the residual excitation

  • [Drugman;’13] ICASSP

– Prediction of creaky voice position

  • [Raitio;’13] Interspeech

– Creaky voice integrated into HTS

  • Proposed method

– Uses another excitation model – Improvement of previous regular-to-irregular transformation – 3 heuristics model irregular voice

9/32

slide-10
SLIDE 10

[Bőhm;’09] regular-to-irregular transformation

10/32

slide-11
SLIDE 11

OUR METHODS

11/32

slide-12
SLIDE 12

Baseline: HTS-CDBK excitation model

  • HTS-CDBK [Csapó&Németh’12]

– Residual based – MGC analysis – Codebook of pitch-synchronous residuals – White noise above 6 kHz

  • Parameters

– MGC: Mel-Generalized Cepstrum – F0: of the frame – gain: RMS energy of the windowed frame – rt0 peak indices: the locations of peaks in the frame – HNR: Harmonics-To-Noise ratio of the frame [de Krom’93]

12/32

slide-13
SLIDE 13

Baseline: HTS-CDBK rt0 parameter

  • position
  • f peaks

(distance)

  • simple peak

picking

  • suitable for

machine learning

13/32

slide-14
SLIDE 14

Baseline: HTS-CDBK analysis

14/32

slide-15
SLIDE 15

Baseline: HTS-CDBK synthesis

15/32

slide-16
SLIDE 16

Novel: HTS-CDBK+Irreg-Rule synthesis

16/32

slide-17
SLIDE 17

Heuristic #1: F0 halving

  • Irregular speech: often significantly lower F0

than regular speech

  • Synthesis: half of the F0 of the generated

parameter sequence is used

– Residual frames are zero padded – Similar effect as removing every 2nd pitch cycle – Results in decreased open quotient

17/32

slide-18
SLIDE 18

Heuristic #2: gain scaling

  • Irregular speech: often strong amplitude

attenuations during the consecutive cycles

  • Synthesis: residual frames are multiplied by

random scaling factors in the range of {0..1}

– do not boost any of the periods, only attenuate or leave them unchanged

18/32

slide-19
SLIDE 19

Heuristic #3: Spectral distortion

  • Irregular speech: frame-by-frame MGC

parameters are less smooth than those of regular speech

  • Synthesis: distort MGC parameters

– parameter values are multiplied by random numbers between {0.995…1.005} – yields less smooth parameter sequence

19/32

slide-20
SLIDE 20

Position of irregular speech

  • Irregular speech: often causes F0 detection

errors in sentence-final vowels (F0=0)

  • Synthesis: F0=0 pattern of sentence-final

vowels is modeled by machine learning

– Irregular voice applied if 5 consecutive frames have F0=0 – Indirect method for position of creaky voice – F0 interpolation between voiced parts

20/32

slide-21
SLIDE 21

RESULTS

21/32

slide-22
SLIDE 22

Waveforms: 3 heuristics

22/32

slide-23
SLIDE 23

Residuals + speech: baseline vs. novel

23/32

slide-24
SLIDE 24

Perceptual evaluation: speech data

  • 2 Hungarian male speakers with frequent

irregular voice

– About 2 hours of speech (1940 sentences) – 16 kHz, 16 bit waveforms + labels – Single speaker training with HTS-CDBK and HTS-CDBK+Irreg-Rule – 10-10 synthesized samples from baseline and novel systems

  • words from sentence endings with irregular voice

24/32

slide-25
SLIDE 25

Perceptual evaluation: methods

  • Internet-based test

– Paired comparison

  • Questions: Comparative MOS (CMOS)

– 1: preference (‘Which version do you think is more pleasant?’) – 2: similarity to the original speaker (‘Which version is more similar to the original speaker?’)

  • Listeners

– 11 students and professionals

25/32

slide-26
SLIDE 26

Perceptual evaluation: results

14% 34% 53% 15% 39% 45% preference FF3 FF4 Speaker Baseline equal Proposed #1 0% 25% 50% 75% 100% 18% 28% 54% 15% 31% 55% similarity FF3 FF4 Speaker

HTS-CDBK+Irreg-Rule

  • Significant differences (p<0.0005) for proposed model

26/32

40% 33% 30% 28%

slide-27
SLIDE 27

Acoustic evaluation: methods

  • Acoustic cues: irregular vs. regular speech

[Klatt&Klatt’90] [Bőhm;’09]

– lower open quotient (OQ) – increased first formant bandwidth (B1) – lower spectral tilt (TL)

  • Measurement in the frequency domain

– OQ ~ H1-H2 (the difference of the amplitudes of the first two harmonics) – 1/B1 ~ H1-A1 (H1 relative to the first formant amplitude) – TL ~ H1-A3 (H1 relative to the third formant amplitude) – compensation of the first three formants

  • Samples

– 10 original regular, 10 original irregular, 10 synthesized irregular

27/32

slide-28
SLIDE 28

Acoustic evaluation: measurements

500 1000 1500 2000 2500 3000 3500 4000

  • 40
  • 30
  • 20
  • 10

10 20 30 Frequency (Hz) Magnitude (dB) H1 fH1 H2 fH2 F1 A1 F2 A2 F3 A3

28/32

slide-29
SLIDE 29

Acoustic evaluation: results

H1*-H2* H1*-A1 H1*-A3*

  • 15
  • 10
  • 5

5 10 15 20 25 parameter value [dB] ~ open quotient ~ 1 / first formant bandwidth ~ spectral tilt

2.1

  • 8.5
  • 9.6
  • 4.6
  • 11.8
  • 12.3

22.4 20.7 18.9

  • riginal regular
  • riginal irregular

synthesized irregular

29/32

slide-30
SLIDE 30

SUMMARY

30/32

slide-31
SLIDE 31

Discussion and conclusions

  • Irregular phonation: no strict definition
  • 3 heuristics to model in synthesis

– Extremely low F0 – Amplitude attenuations – Perturbations in spectrum

  • Perception & acoustic tests

– More preferred and more similar to original speaker – Similar to original irregular samples

  • Possible applications

– Expressive speech synthesis (e.g. sad) – Personalized systems

31/32

slide-32
SLIDE 32

Future directions

  • Pre-defined stylized pulse patterns instead of

random scaling [Bőhm;’09]

  • Data-driven irregular voice model

– Csapó & Németh ,,Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation’’, IEEE Journal of Selected Topics in Signal Processing, Oct 2013

  • Use parameters for irregular voice position

[Drugman;’13]

  • Compare with other models

[Drugman;’12] [Raitio;’13]

32/32

slide-33
SLIDE 33

Tamás Gábor Csapó, Géza Németh: A novel irregular voice model for HMM-based speech synthesis csapot@tmit.bme.hu

This research is partially supported by the following projects:

  • Paelife (Grant No AAL-08-1-2011-0001)
  • CESAR (Grant No 271022)
  • EITKIC_12-1-2012-001
  • Campus Hungary

33/32

slide-34
SLIDE 34

References

  • Blomgren, M. et al., 1998. Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. The Journal of the

Acoustical Society of America, 103(5), pp.2649–2658.

  • Bőhm, T. et al., 2008. Transforming modal voice into irregular voice by amplitude scaling of individual glottal cycles. In Acoustics’08. Paris,

France, pp. 6141–6146.

  • Csapó, T.G. & Németh, G., 2012. A novel codebook-based excitation model for use in speech synthesis. In IEEE CogInfoCom. Kosice, Slovakia:

IEEE, pp. 661–665.

  • Csapó, T.G. & Németh, G., 2013a. Statistical parametric speech synthesis with a novel codebook-based excitation model. Intelligent Decision

Technologies.

  • Csapó, T.G. & Németh, G., 2013b. Transformation of irregular voice to regular voice by residual analysis and synthesis. IEEE Signal Processing

Letters.

  • Dilley, L., Shattuck-Hufnagel, S. & Ostendorf, M., 1996. Glottalization of word-initial vowels as a function of prosodic structure. Journal of

Phonetics, 24(4), pp.423–444.

  • Drugman, T. et al., 2013. Prediction of Creaky Voice from Contextual Factors. In Proc. ICASSP. Vancouver, Canada.
  • Drugman, T., Kane, J. & Gobl, C., 2012. Modeling the Creaky Excitation for Parametric Speech Synthesis. In Proc. Interspeech. Portland, Oregon,

USA, pp. 1424–1427.

  • Drugman, T., Wilfart, G. & Dutoit, T., 2009. A deterministic plus stochastic model of the residual signal for improved parametric speech
  • synthesis. In Proc. Interspeech. Brighton, UK, pp. 1779–1782.
  • Fant, G., Liljencrants, J. & Lin, Q., 1985. A four-parameter model of glottal flow. STL-QPSR, 4, pp.1–13.
  • Gobl, C. & Chasaide, A.N., 2003. The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40(1-2),

pp.189–212.

  • Klatt, D.H. & Klatt, L.C., 1990. Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the

Acoustical Society of America, 87(2), pp.820–857.

  • De Krom, G., 1993. A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. Journal of Speech and Hearing

Research, 36(2), pp.254–266.

  • Raitio, T. et al., 2013. HMM-based synthesis of creaky voice. In Proc. Interspeech.
  • Silén, H. et al., 2009. Parameterization of vocal fry in HMM-based speech synthesis. In Proc. Interspeech. Brighton, UK, pp. 1775–1778.
  • Zen, H., Toda, T., et al., 2007. Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Transactions on

Information and Systems, E90-D(1), pp.325–333.

  • Zen, H., Nose, T., et al., 2007. The HMM-based speech synthesis system version 2.0. In Proc. ISCA SSW6. Bonn, Germany, pp. 294–299.

34/32

slide-35
SLIDE 35

Samples

  • FF3_HTS-CDBK

+ Irreg-Rule

  • FF3_HTS-CDBK

+ Irreg-Rule

  • FF4_HTS-CDBK

+ Irreg-Rule

  • FF4_HTS-CDBK

+ Irreg-Rule

35/32