[PPT] - A perceptual investigation of wavelet-based decomposition of f0 for PowerPoint Presentation

SLIDE 1

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark

School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015

1 / 29

SLIDE 2

Overview

Introduction Motivation Hypotheses Experiments Experiment 2 Experiment 3 Discussion Summary References

2 / 29

SLIDE 3

Introduction

Wavelets in Speech Processing [Farouk, 2014]
Annotation of prominence [Vainio et al, 2013]
Pre-processing step for f0 modeling [Suni et al, 2013],

[Ribeiro and Clark, 2015]

Voice Conversion [Sanchez et al, 2014]
Modeling of f0
General conclusion indicate that signal decomposition is

beneficial for f0 modeling

But it is assumed that all components are equally relevant to

the reconstructed signal

The individual importance of each wavelet scale to the overall signal is not fully understood

3 / 29

SLIDE 4

Introduction

Wavelets in Speech Processing [Farouk, 2014]
Annotation of prominence [Vainio et al, 2013]
Pre-processing step for f0 modeling [Suni et al, 2013],

[Ribeiro and Clark, 2015]

Voice Conversion [Sanchez et al, 2014]
Modeling of f0
General conclusion indicate that signal decomposition is

beneficial for f0 modeling

But it is assumed that all components are equally relevant to

the reconstructed signal

The individual importance of each wavelet scale to the overall signal is not fully understood

4 / 29

SLIDE 5

The Continuous Wavelet Transform

The CWT decomposes

an input signal into various scales of selected frequency.

10-scale decomposition
Each scale approximately

1 octave apart.

f0 reconstruction:
[Suni et al, 2013]

f0(x) =

10

i=1

Ci(x)(i + 2.5)−5/2

5 / 29

SLIDE 6

Hypotheses

Middle frequencies (scales 5-8) are associated with higher levels of naturalness Low frequencies (scales 1-4) don’t contain much information and are comparable to HMM-generated f0 High frequencies (scales 9-10) are mostly noise and do not contribute much to the per- ceived naturalness

6 / 29

SLIDE 7

Conditions and Reconstruction

Experimental conditions

Condition Description

Freq. (Hz)

natural Vocoded speech using natural parameters

all

All f0 frequencies. 0.1-50 1-2 Low frequencies. Scales indexed at 1 and 2. 0.1-0.2 3-4 Low frequencies. Scales indexed at 3 and 4. 0.4-0.8 1-4 All low frequencies. Scales indexed at 1, 2, 3, and 4. 0.1-0.8 5-6 Middle frequencies. Scales indexed at 5 and 6. 1.6-3.2 7-8 Middle frequencies. Scales indexed at 7 and 8. 6.3-13 5-8 All middle frequencies. Scales indexed at 5, 6, 7, and 8. 1.6-13 9-10 High frequencies. Scales indexed at 9 and 10. 25-50 MSD-HMM f0 signal predicted from an MSD-HMM.

Table: Experimental conditions with approximate CWT frequency ranges.

F0 reconstruction f0(x) =

10

i=1

wiCi(x)(i + 2.5)−5/2 where wi is the weight given to scale i where wi ∈ {0, 1}

7 / 29

SLIDE 8

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

8 / 29

SLIDE 9

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

9 / 29

SLIDE 10

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

10 / 29

SLIDE 11

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

11 / 29

SLIDE 12

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

12 / 29

SLIDE 13

Experiment 2 - Similarity

Data
Expressive audiobook data
mel-cepstral, aperiodicity, voicing parameters
HMM system trained on roughly 5000 utterances
duration is force-aligned
natural (vocoded) condition uses original parameters
remaining conditions use natural f0 processed accordingly
Design
20 utterances synthesized for each of the 10 conditions
10 native listeners. Each rating 144 utterance pairs
Each pair consists of different utterances and different

conditions

No repetitions (utterance or condition) within any three

consecutive pairs

Participants asked to judge if the pair is similar or different in

terms of naturalness

13 / 29

SLIDE 14

Experiment 2 - Similarity

45 distinct condition pairs, each pair judged at least 32 times
Create 10x10 dissimilarity matrix and embed it into a

2-dimensional space with MDS

Kruskal’s normalized stress1 with stress value of 0.086

14 / 29

SLIDE 15

Experiment 2 - Similarity

Listeners naturally clustered low, middle, and high frequencies
All frequencies seems to be similar to the middle frequencies
It is also farther from natural speech than middle frequencies
Listeners tend do prefer the CWT middle frequencies
These have been previously associated with the word level

15 / 29

SLIDE 16

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

16 / 29

SLIDE 17

Experiment 3 - MUSHRA

Data
Expressive audiobook data (same as Similarity Experiment).
Design
Ask participants to judge all conditions simultaneously (1 to

100).

Reference is given as the natural condition.
10 Participants rate 20 sets of 10 stimuli.
From the 200 expected sets, 48 were discarded as the hidden

reference was not judged as natural.

152 sets were used for analysis.

17 / 29

SLIDE 18

Experiment 3 - MUSHRA

18 / 29

SLIDE 19

Main Conclusions

Mid-frequencies
Consistently achieve better results
Naturalness is almost comparable to all frequencies
Have been associated previously with the word-level

[Suni et al, 2013], [Ribeiro and Clark, 2015]

Low-frequencies
Comparable to HMM generated f0 (Prominence, MUSHRA,

MOS tests)

Although not really the same (similarity test)
Previously associated with phrase and utterance levels
High-frequencies
Consistently judged the most unnatural condition
Not really relevant to naturalness
Previously associated with the phone-level

19 / 29

SLIDE 20

Earlier assumptions

Earlier assumptions [Suni et al, 2013], [Ribeiro and Clark, 2015]

1 All wavelet components are equally relevant to the

reconstructed signal

2 The association of wavelet components to linguistic levels is

meaningful

First assumption shown not to be true.
Middle frequencies carry most of the information
Low and high frequencies not so relevant
How about their association with linguistic levels?

20 / 29

SLIDE 21

Unit and Peak Rates

Compute unit and peak rates at utterance level for 5000

utterances

Count units and peaks (local maxima) and divide by utterance

duration in seconds

21 / 29

SLIDE 22

Summary

Main Findings
Wavelet components do not carry equal weights for the f0

signal

Middle frequencies convey most of the information
HMM-generated f0 is somewhat similar to low-frequencies
Association with linguistic levels is not very good
Speech Samples
http://homepages.inf.ed.ac.uk/s1250520/samples/interspeech15.html
Future Work
Associate each scale with meaningful linguistic-levels
Use middle frequencies to learn relevant syllable and word-level

features

22 / 29

SLIDE 23

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

Thank you for listening

M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark

School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk

8 September 2015

23 / 29

SLIDE 24

Extra Slides

24 / 29

SLIDE 25

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

25 / 29

SLIDE 26

Experiment 1 - Prominence

Data
Same sentence uttered in different contexts.
This encourages different prosody.
10 different utterances in 4 contexts. Total of 40 utterances

stimulus response ... John won at Mary’s. Paul won at Mary’s. John won at Mary’s. John lost at Mary’s. John won at Mary’s. John won at Kate’s. John won at Mary’s.

26 / 29

SLIDE 27

Experiment 1 - Prominence

Experiment
Ask listeners to judge which word appears more prominent
Measure accuracy for each condition
Determine correctness from natural condition [Cole et al, 2010]
Hypothesis
Mid-frequencies contain most of the prominence-related

information

High frequency and low frequency conditions will achieve lower

accuracy

Data Preparation
copy-synthesis (STRAIGHT)
natural condition uses all original parameters
for remaining conditions:
mcep/bap/duration used from neutral stimulus
f0 extracted from each context

27 / 29

SLIDE 28

Experiment 1 - Prominence

Experiment details
25 participants
Each utterance judged at least 5 times per condition
Each condition with roughly 200 judgments

28 / 29

SLIDE 29

References

[Ribeiro and Clark, 2015] Ribeiro, M.S., & Clark, R. (2015). A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform.

Proc. ICASSP 2015.

[Suni et al, 2013] Suni, A. S., Aalto, D., Raitio, T., Alku, P., & Vainio, M. (2013). Wavelets for intonation modeling in HMM speech synthesis. 8th ISCA Workshop on Speech Synthesis. [Farouk, 2014] Farouk, M. H. (2014). Application of Wavelets in Speech Processing. Springer, New York. [Sanchez et al, 2014] Sanchez, G., Silen, H., Nurminen, J., & Gabbouj, M. (2014). Hierarchical modeling of F0 contours for voice conversion. Fifteenth Annual Conference of the International Speech Communication Association. [Vainio et al, 2013] Vainio, M., Suni, A., & Aalto, D. (2013). Continuous wavelet transform for analysis of speech prosody. TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, Aix-en-Provence, France. [Cole et al, 2010] Cole, J., Mo, Y., & Hasegawa-Johnson, M. (2010). Signal-based and expectation-based factors in the perception of prosodic prominence. Laboratory Phonology, 1(2). 29 / 29