A perceptual investigation of wavelet-based decomposition of f0 for - - PowerPoint PPT Presentation

a perceptual investigation of wavelet based decomposition
SMART_READER_LITE
LIVE PREVIEW

A perceptual investigation of wavelet-based decomposition of f0 for - - PowerPoint PPT Presentation

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1


slide-1
SLIDE 1

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

  • M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark

School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015

1 / 29

slide-2
SLIDE 2

Overview

Introduction Motivation Hypotheses Experiments Experiment 2 Experiment 3 Discussion Summary References

2 / 29

slide-3
SLIDE 3

Introduction

  • Wavelets in Speech Processing [Farouk, 2014]
  • Annotation of prominence [Vainio et al, 2013]
  • Pre-processing step for f0 modeling [Suni et al, 2013],

[Ribeiro and Clark, 2015]

  • Voice Conversion [Sanchez et al, 2014]
  • Modeling of f0
  • General conclusion indicate that signal decomposition is

beneficial for f0 modeling

  • But it is assumed that all components are equally relevant to

the reconstructed signal

The individual importance of each wavelet scale to the overall signal is not fully understood

3 / 29

slide-4
SLIDE 4

Introduction

  • Wavelets in Speech Processing [Farouk, 2014]
  • Annotation of prominence [Vainio et al, 2013]
  • Pre-processing step for f0 modeling [Suni et al, 2013],

[Ribeiro and Clark, 2015]

  • Voice Conversion [Sanchez et al, 2014]
  • Modeling of f0
  • General conclusion indicate that signal decomposition is

beneficial for f0 modeling

  • But it is assumed that all components are equally relevant to

the reconstructed signal

The individual importance of each wavelet scale to the overall signal is not fully understood

4 / 29

slide-5
SLIDE 5

The Continuous Wavelet Transform

  • The CWT decomposes

an input signal into various scales of selected frequency.

  • 10-scale decomposition
  • Each scale approximately

1 octave apart.

  • f0 reconstruction:
  • [Suni et al, 2013]

f0(x) =

10

  • i=1

Ci(x)(i + 2.5)−5/2

5 / 29

slide-6
SLIDE 6

Hypotheses

Middle frequencies (scales 5-8) are associated with higher lev- els of naturalness Low frequencies (scales 1-4) don’t contain much informa- tion and are comparable to HMM-generated f0 High frequencies (scales 9-10) are mostly noise and do not contribute much to the per- ceived naturalness

6 / 29

slide-7
SLIDE 7

Conditions and Reconstruction

Experimental conditions

Condition Description

  • Freq. (Hz)

natural Vocoded speech using natural parameters

  • all

All f0 frequencies. 0.1-50 1-2 Low frequencies. Scales indexed at 1 and 2. 0.1-0.2 3-4 Low frequencies. Scales indexed at 3 and 4. 0.4-0.8 1-4 All low frequencies. Scales indexed at 1, 2, 3, and 4. 0.1-0.8 5-6 Middle frequencies. Scales indexed at 5 and 6. 1.6-3.2 7-8 Middle frequencies. Scales indexed at 7 and 8. 6.3-13 5-8 All middle frequencies. Scales indexed at 5, 6, 7, and 8. 1.6-13 9-10 High frequencies. Scales indexed at 9 and 10. 25-50 MSD-HMM f0 signal predicted from an MSD-HMM.

  • Table: Experimental conditions with approximate CWT frequency ranges.

F0 reconstruction f0(x) =

10

  • i=1

wiCi(x)(i + 2.5)−5/2 where wi is the weight given to scale i where wi ∈ {0, 1}

7 / 29

slide-8
SLIDE 8

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

8 / 29

slide-9
SLIDE 9

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

9 / 29

slide-10
SLIDE 10

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

10 / 29

slide-11
SLIDE 11

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

11 / 29

slide-12
SLIDE 12

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

12 / 29

slide-13
SLIDE 13

Experiment 2 - Similarity

  • Data
  • Expressive audiobook data
  • mel-cepstral, aperiodicity, voicing parameters
  • HMM system trained on roughly 5000 utterances
  • duration is force-aligned
  • natural (vocoded) condition uses original parameters
  • remaining conditions use natural f0 processed accordingly
  • Design
  • 20 utterances synthesized for each of the 10 conditions
  • 10 native listeners. Each rating 144 utterance pairs
  • Each pair consists of different utterances and different

conditions

  • No repetitions (utterance or condition) within any three

consecutive pairs

  • Participants asked to judge if the pair is similar or different in

terms of naturalness

13 / 29

slide-14
SLIDE 14

Experiment 2 - Similarity

  • 45 distinct condition pairs, each pair judged at least 32 times
  • Create 10x10 dissimilarity matrix and embed it into a

2-dimensional space with MDS

  • Kruskal’s normalized stress1 with stress value of 0.086

14 / 29

slide-15
SLIDE 15

Experiment 2 - Similarity

  • Listeners naturally clustered low, middle, and high frequencies
  • All frequencies seems to be similar to the middle frequencies
  • It is also farther from natural speech than middle frequencies
  • Listeners tend do prefer the CWT middle frequencies
  • These have been previously associated with the word level

15 / 29

slide-16
SLIDE 16

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

16 / 29

slide-17
SLIDE 17

Experiment 3 - MUSHRA

  • Data
  • Expressive audiobook data (same as Similarity Experiment).
  • Design
  • Ask participants to judge all conditions simultaneously (1 to

100).

  • Reference is given as the natural condition.
  • 10 Participants rate 20 sets of 10 stimuli.
  • From the 200 expected sets, 48 were discarded as the hidden

reference was not judged as natural.

  • 152 sets were used for analysis.

17 / 29

slide-18
SLIDE 18

Experiment 3 - MUSHRA

18 / 29

slide-19
SLIDE 19

Main Conclusions

  • Mid-frequencies
  • Consistently achieve better results
  • Naturalness is almost comparable to all frequencies
  • Have been associated previously with the word-level

[Suni et al, 2013], [Ribeiro and Clark, 2015]

  • Low-frequencies
  • Comparable to HMM generated f0 (Prominence, MUSHRA,

MOS tests)

  • Although not really the same (similarity test)
  • Previously associated with phrase and utterance levels
  • High-frequencies
  • Consistently judged the most unnatural condition
  • Not really relevant to naturalness
  • Previously associated with the phone-level

19 / 29

slide-20
SLIDE 20

Earlier assumptions

Earlier assumptions [Suni et al, 2013], [Ribeiro and Clark, 2015]

1 All wavelet components are equally relevant to the

reconstructed signal

2 The association of wavelet components to linguistic levels is

meaningful

  • First assumption shown not to be true.
  • Middle frequencies carry most of the information
  • Low and high frequencies not so relevant
  • How about their association with linguistic levels?

20 / 29

slide-21
SLIDE 21

Unit and Peak Rates

  • Compute unit and peak rates at utterance level for 5000

utterances

  • Count units and peaks (local maxima) and divide by utterance

duration in seconds

21 / 29

slide-22
SLIDE 22

Summary

  • Main Findings
  • Wavelet components do not carry equal weights for the f0

signal

  • Middle frequencies convey most of the information
  • HMM-generated f0 is somewhat similar to low-frequencies
  • Association with linguistic levels is not very good
  • Speech Samples
  • http://homepages.inf.ed.ac.uk/s1250520/samples/interspeech15.html
  • Future Work
  • Associate each scale with meaningful linguistic-levels
  • Use middle frequencies to learn relevant syllable and word-level

features

22 / 29

slide-23
SLIDE 23

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

Thank you for listening

  • M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark

School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk

8 September 2015

23 / 29

slide-24
SLIDE 24

Extra Slides

24 / 29

slide-25
SLIDE 25

Perceptual Experiments

Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.

25 / 29

slide-26
SLIDE 26

Experiment 1 - Prominence

  • Data
  • Same sentence uttered in different contexts.
  • This encourages different prosody.
  • 10 different utterances in 4 contexts. Total of 40 utterances

stimulus response ... John won at Mary’s. Paul won at Mary’s. John won at Mary’s. John lost at Mary’s. John won at Mary’s. John won at Kate’s. John won at Mary’s.

26 / 29

slide-27
SLIDE 27

Experiment 1 - Prominence

  • Experiment
  • Ask listeners to judge which word appears more prominent
  • Measure accuracy for each condition
  • Determine correctness from natural condition [Cole et al, 2010]
  • Hypothesis
  • Mid-frequencies contain most of the prominence-related

information

  • High frequency and low frequency conditions will achieve lower

accuracy

  • Data Preparation
  • copy-synthesis (STRAIGHT)
  • natural condition uses all original parameters
  • for remaining conditions:
  • mcep/bap/duration used from neutral stimulus
  • f0 extracted from each context

27 / 29

slide-28
SLIDE 28

Experiment 1 - Prominence

  • Experiment details
  • 25 participants
  • Each utterance judged at least 5 times per condition
  • Each condition with roughly 200 judgments

28 / 29

slide-29
SLIDE 29

References

[Ribeiro and Clark, 2015] Ribeiro, M.S., & Clark, R. (2015). A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform.

  • Proc. ICASSP 2015.

[Suni et al, 2013] Suni, A. S., Aalto, D., Raitio, T., Alku, P., & Vainio, M. (2013). Wavelets for intonation modeling in HMM speech synthesis. 8th ISCA Workshop on Speech Synthesis. [Farouk, 2014] Farouk, M. H. (2014). Application of Wavelets in Speech Processing. Springer, New York. [Sanchez et al, 2014] Sanchez, G., Silen, H., Nurminen, J., & Gabbouj, M. (2014). Hierarchical modeling of F0 contours for voice conversion. Fifteenth Annual Conference of the International Speech Communication Association. [Vainio et al, 2013] Vainio, M., Suni, A., & Aalto, D. (2013). Continuous wavelet transform for analysis of speech prosody. TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, Aix-en-Provence, France. [Cole et al, 2010] Cole, J., Mo, Y., & Hasegawa-Johnson, M. (2010). Signal-based and expectation-based factors in the perception of prosodic prominence. Laboratory Phonology, 1(2). 29 / 29