Modeling Prosody Pattern of Chinese Expressive Speech Application - - PowerPoint PPT Presentation

modeling prosody pattern of chinese expressive speech
SMART_READER_LITE
LIVE PREVIEW

Modeling Prosody Pattern of Chinese Expressive Speech Application - - PowerPoint PPT Presentation

Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong Outline Background Introduction Modeling prosody pattern of expressive speech


slide-1
SLIDE 1

Modeling Prosody Pattern of Chinese Expressive Speech

Application in Personalized Speech Conversion

Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong

slide-2
SLIDE 2

Outline

  • Background
  • Introduction
  • Modeling prosody pattern of expressive speech
  • Application in personalized speech conversion
  • Experiments
  • Conclusion

2

slide-3
SLIDE 3

Background

  • Neutral and expressive speech

– Neutral expressive

  • Personality of different speakers uttering the same

expressive speech

– Speaker 1 Speaker 2

  • Related researches

– Meng, F. etc Synthesizing Expressive Speech to Convey Focus – Li, K. etc Automatic Lexical and Pitch Accent Detection – Yang, H. etc Modeling the Acoustic Correlates of Expressive Elements

3

slide-4
SLIDE 4

Introduction

  • To model prosody pattern of expressive speech

– Focus on pitch, intensity, duration of the speech – Identify the core and non-core syllables of a prosodic word – Propose a double-layer perturbation model

  • To apply the model in personalized speech

conversion

– Propose a two-step method to convert the speech

4

slide-5
SLIDE 5

Modeling prosody pattern

  • Corpus——Text prompts

– Text prompts are extracted from Hong Kong Tourism Board – Each text prompt introduces the attractive features of a scenic spot – 25 utterances in total

  • 120 phrases, 416 prosodic words and 1231 syllables

<Name of tourist spot>太平山顶(English: Victoria Peak) <Descriptive text> 太平山顶是香港最受欢迎的名胜景点之一,登临其间,可俯瞰山下鳞次栉 比的摩天高楼和享誉全球的维多利亚港景色。 (Victoria Peak is the most popular scenic spot in Hong Kong. When you climb up, you

can overlook the row upon row of skyscrapers and the word famous Victoria Harbor.)

5

slide-6
SLIDE 6
  • Corpus——Expressivity annotation

– Adopt the PAD model (Mehrabian 1995) to describe the expressivity – Use the A (arousal-nonarousal) descriptor to measure the expressive degree (e.g. superlative, comparative, etc.) – A = 0.2, 0.4, 0.6, 0.8, 1.0 – 272 prosodic words with A > 0

  • Corpus—— Contrastive speech recordings

– Four native Mandarin speakers (two males and two females) – Record the text prompts twice: neutral and expressive – 50 files of speech recordings for each speaker

  • Saved in wav format (16 bit mono, sampled at 16 kHz).

Modeling prosody pattern

6

slide-7
SLIDE 7
  • Acoustic features

– Mean F0 – F0 Range – Duration – RMS Energy

  • Acoustic measurements

Modeling prosody pattern

7

slide-8
SLIDE 8
  • Classification of core and non-core syllable

– The acoustic variations (from neutral to expressive speech) of core syllables are more significant than non-core syllables.

  • Neutral and expressive speech

– Neutral expressive

Modeling prosody pattern

太平山顶是香港最受欢迎的名胜景点之一…… (Victoria Peak is the most popular scenic spot in Hong Kong……) 8

slide-9
SLIDE 9
  • Acoustic analysis of the core syllables

– Core syllables (272) – R has the positive correlation with A. R A 0.2 0.4 0.6 0.8 1.0 Mean F0 1.09 1.11 1.14 1.16 1.18 F0 Range 1.12 1.16 1.19 1.25 1.31 Duration 1.06 1.09 1.10 1.11 1.13 RMS Energy 1.20 1.38 1.54 1.71 1.94

Modeling prosody pattern

9

slide-10
SLIDE 10
  • Acoustic analysis (R) of the non-core syllables

– Non-core syllables (615) – R is bigger for core syllables than non-core syllables – R is negatively correlated with the distance

Distance to the core syllable

A 0.2 0.4 0.6 0.8 1.0 1.20 1.38 1.54 1.71 1.94 1 1.11 1.30 1.48 1.68 1.88 2 1.03 1.17 1.31 1.42 1.58 3 0.97 1.07 1.16 1.27 1.55

Modeling prosody pattern

10

slide-11
SLIDE 11
  • Acoustic analysis (R) of the non-core syllables

Mean F0 Dis.

A 0.2 0.4 0.6 0.8 1.0 1.09 1.11 1.14 1.16 1.18 1 1.07 1.08 1.08 1.09 1.09 2 1.04 1.05 1.05 1.06 1.06 3 1.02 1.02 1.03 1.03 1.04

Modeling prosody pattern

F0 Range Dis.

A 0.2 0.4 0.6 0.8 1.0 1.12 1.16 1.19 1.25 1.31 1 1.02 1.06 1.09 1.05 1.07 2 1.01 1.02 1.03 1.04 1.05 3 0.97 0.98 1.01 1.02 1.06

Duration Dis.

A 0.2 0.4 0.6 0.8 1.0 1.06 1.09 1.10 1.11 1.18 1 1.05 1.06 1.06 1.07 1.07 2 1.01 1.01 0.99 0.99 0.99 3 0.93 0.92 0.90 0.89 0.88 11

slide-12
SLIDE 12
  • Double-layer perturbation model

– Core syllable – Non-core syllable

Modeling prosody pattern

12

slide-13
SLIDE 13

Application in personalized speech conversion

  • Acoustic analysis of different speakers

– Core syllables

13

slide-14
SLIDE 14

Application in personalized speech conversion

  • Acoustic analysis of different speakers

– Non-core syllables

14

slide-15
SLIDE 15

Application in personalized speech conversion

  • To generate speech with target speaker’s prosody

characteristics

– Step 1: to convert neutral speech from speaker s to t – Step 2: to generate expressive speech for speaker t

  • Acoustic features of the target expressive speech

15

slide-16
SLIDE 16
  • 10 text phrases were randomly selected from our corpus
  • 3 files were designed for each phrase

– a) The neutral speech recording of a speaker – b) The expressive speech recording of the same speaker – c) The transformed speech from a)

  • 15 native Mandarin speakers were invited as subjects

– To listen to files played in the order of a)-b)-c)-a)-b)-c) – To judge where c) sounds similar to its counterpart b) – To give a MOS score from 1 to 5 indicating the level of the similarity between c) and b)

Experiment 1

16

slide-17
SLIDE 17
  • Average score is 3.94
  • About 90% of the file c) is more similar to b) than to a)

Experiment 1

17

slide-18
SLIDE 18
  • Another 10 phrases were selected from our corpus
  • 4 files were designed for each phrase

– d) the expressive speech recording of speaker 1 – e) the expressive speech recording of speaker 2 – f) the transformed speech from NEU using speaker 1’s model – g) the transformed speech from NEU using speaker 2’s model – NEU) the neutral speech recording of speaker 3

  • 15 native Mandarin speakers were invited as subjects

– To listen to the files in the order of d)-e)-x), where x) might be f) or g) – To judge where x) is imitating d) or e)

Experiment 2

18

slide-19
SLIDE 19
  • Results

– The proposed model can reflect the personalized features of the prosody patterns of different speakers. – The proposed method for personalized speech conversion is able to achieve good performance. Speaker i Accuracy Speaker 1 74.6% Speaker 2 72.7%

Experiment 2

19

slide-20
SLIDE 20
  • Proposed a double-layer perturbation model for

modeling the prosody patterns of expressive speech

– Identify the core syllable and non-core syllable – Use the Mean F0, F0 range, duration and RMS energy

  • Applied the above model in personalized speech

conversion

– Propose a two-step method for generating personalized prosody patterns

Conclusions

20

slide-21
SLIDE 21

21