[PPT] - Modeling Prosody Pattern of Chinese Expressive Speech Application PowerPoint Presentation

SLIDE 1

Modeling Prosody Pattern of Chinese Expressive Speech

Application in Personalized Speech Conversion

Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong

SLIDE 2

Outline

Background
Introduction
Modeling prosody pattern of expressive speech
Application in personalized speech conversion
Experiments
Conclusion

2

SLIDE 3

Background

Neutral and expressive speech

– Neutral expressive

Personality of different speakers uttering the same

expressive speech

– Speaker 1 Speaker 2

Related researches

– Meng, F. etc Synthesizing Expressive Speech to Convey Focus – Li, K. etc Automatic Lexical and Pitch Accent Detection – Yang, H. etc Modeling the Acoustic Correlates of Expressive Elements

3

SLIDE 4

Introduction

To model prosody pattern of expressive speech

– Focus on pitch, intensity, duration of the speech – Identify the core and non-core syllables of a prosodic word – Propose a double-layer perturbation model

To apply the model in personalized speech

conversion

– Propose a two-step method to convert the speech

4

SLIDE 5

Modeling prosody pattern

Corpus——Text prompts

– Text prompts are extracted from Hong Kong Tourism Board – Each text prompt introduces the attractive features of a scenic spot – 25 utterances in total

120 phrases, 416 prosodic words and 1231 syllables

<Name of tourist spot>太平山顶(English: Victoria Peak) <Descriptive text> 太平山顶是香港最受欢迎的名胜景点之一，登临其间，可俯瞰山下鳞次栉比的摩天高楼和享誉全球的维多利亚港景色。 (Victoria Peak is the most popular scenic spot in Hong Kong. When you climb up, you

can overlook the row upon row of skyscrapers and the word famous Victoria Harbor.)

5

SLIDE 6

Corpus——Expressivity annotation

– Adopt the PAD model (Mehrabian 1995) to describe the expressivity – Use the A (arousal-nonarousal) descriptor to measure the expressive degree (e.g. superlative, comparative, etc.) – A = 0.2, 0.4, 0.6, 0.8, 1.0 – 272 prosodic words with A > 0

Corpus—— Contrastive speech recordings

– Four native Mandarin speakers (two males and two females) – Record the text prompts twice: neutral and expressive – 50 files of speech recordings for each speaker

Saved in wav format (16 bit mono, sampled at 16 kHz).

Modeling prosody pattern

6

SLIDE 7

Acoustic features

– Mean F0 – F0 Range – Duration – RMS Energy

Acoustic measurements

Modeling prosody pattern

7

SLIDE 8

Classification of core and non-core syllable

– The acoustic variations (from neutral to expressive speech) of core syllables are more significant than non-core syllables.

Neutral and expressive speech

– Neutral expressive

Modeling prosody pattern

太平山顶是香港最受欢迎的名胜景点之一…… (Victoria Peak is the most popular scenic spot in Hong Kong……) 8

SLIDE 9

Acoustic analysis of the core syllables

– Core syllables (272) – R has the positive correlation with A. R A 0.2 0.4 0.6 0.8 1.0 Mean F0 1.09 1.11 1.14 1.16 1.18 F0 Range 1.12 1.16 1.19 1.25 1.31 Duration 1.06 1.09 1.10 1.11 1.13 RMS Energy 1.20 1.38 1.54 1.71 1.94

Modeling prosody pattern

9

SLIDE 10

Acoustic analysis (R) of the non-core syllables

– Non-core syllables (615) – R is bigger for core syllables than non-core syllables – R is negatively correlated with the distance

Distance to the core syllable

A 0.2 0.4 0.6 0.8 1.0 1.20 1.38 1.54 1.71 1.94 1 1.11 1.30 1.48 1.68 1.88 2 1.03 1.17 1.31 1.42 1.58 3 0.97 1.07 1.16 1.27 1.55

Modeling prosody pattern

10

SLIDE 11

Acoustic analysis (R) of the non-core syllables

Mean F0 Dis.

A 0.2 0.4 0.6 0.8 1.0 1.09 1.11 1.14 1.16 1.18 1 1.07 1.08 1.08 1.09 1.09 2 1.04 1.05 1.05 1.06 1.06 3 1.02 1.02 1.03 1.03 1.04

Modeling prosody pattern

F0 Range Dis.

A 0.2 0.4 0.6 0.8 1.0 1.12 1.16 1.19 1.25 1.31 1 1.02 1.06 1.09 1.05 1.07 2 1.01 1.02 1.03 1.04 1.05 3 0.97 0.98 1.01 1.02 1.06

Duration Dis.

A 0.2 0.4 0.6 0.8 1.0 1.06 1.09 1.10 1.11 1.18 1 1.05 1.06 1.06 1.07 1.07 2 1.01 1.01 0.99 0.99 0.99 3 0.93 0.92 0.90 0.89 0.88 11

SLIDE 12

Double-layer perturbation model

– Core syllable – Non-core syllable

Modeling prosody pattern

12

SLIDE 13

Application in personalized speech conversion

Acoustic analysis of different speakers

– Core syllables

13

SLIDE 14

Application in personalized speech conversion

Acoustic analysis of different speakers

– Non-core syllables

14

SLIDE 15

Application in personalized speech conversion

To generate speech with target speaker’s prosody

characteristics

– Step 1: to convert neutral speech from speaker s to t – Step 2: to generate expressive speech for speaker t

Acoustic features of the target expressive speech

15

SLIDE 16

10 text phrases were randomly selected from our corpus
3 files were designed for each phrase

– a) The neutral speech recording of a speaker – b) The expressive speech recording of the same speaker – c) The transformed speech from a)

15 native Mandarin speakers were invited as subjects

– To listen to files played in the order of a)-b)-c)-a)-b)-c) – To judge where c) sounds similar to its counterpart b) – To give a MOS score from 1 to 5 indicating the level of the similarity between c) and b)

Experiment 1

16

SLIDE 17

Average score is 3.94
About 90% of the file c) is more similar to b) than to a)

Experiment 1

17

SLIDE 18

Another 10 phrases were selected from our corpus
4 files were designed for each phrase

– d) the expressive speech recording of speaker 1 – e) the expressive speech recording of speaker 2 – f) the transformed speech from NEU using speaker 1’s model – g) the transformed speech from NEU using speaker 2’s model – NEU) the neutral speech recording of speaker 3

15 native Mandarin speakers were invited as subjects

– To listen to the files in the order of d)-e)-x), where x) might be f) or g) – To judge where x) is imitating d) or e)

Experiment 2

18

SLIDE 19

Results

– The proposed model can reflect the personalized features of the prosody patterns of different speakers. – The proposed method for personalized speech conversion is able to achieve good performance. Speaker i Accuracy Speaker 1 74.6% Speaker 2 72.7%

Experiment 2

19

SLIDE 20

Proposed a double-layer perturbation model for

modeling the prosody patterns of expressive speech

– Identify the core syllable and non-core syllable – Use the Mean F0, F0 range, duration and RMS energy

Applied the above model in personalized speech

conversion

– Propose a two-step method for generating personalized prosody patterns

Conclusions

20

SLIDE 21

21