A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis
- M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark
School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015
1 / 29
A perceptual investigation of wavelet-based decomposition of f0 for - - PowerPoint PPT Presentation
A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1
A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis
School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015
1 / 29
Overview
Introduction Motivation Hypotheses Experiments Experiment 2 Experiment 3 Discussion Summary References
2 / 29
Introduction
[Ribeiro and Clark, 2015]
beneficial for f0 modeling
the reconstructed signal
The individual importance of each wavelet scale to the overall signal is not fully understood
3 / 29
Introduction
[Ribeiro and Clark, 2015]
beneficial for f0 modeling
the reconstructed signal
The individual importance of each wavelet scale to the overall signal is not fully understood
4 / 29
The Continuous Wavelet Transform
an input signal into various scales of selected frequency.
1 octave apart.
f0(x) =
10
Ci(x)(i + 2.5)−5/2
5 / 29
Hypotheses
Middle frequencies (scales 5-8) are associated with higher lev- els of naturalness Low frequencies (scales 1-4) don’t contain much informa- tion and are comparable to HMM-generated f0 High frequencies (scales 9-10) are mostly noise and do not contribute much to the per- ceived naturalness
6 / 29
Conditions and Reconstruction
Experimental conditions
Condition Description
natural Vocoded speech using natural parameters
All f0 frequencies. 0.1-50 1-2 Low frequencies. Scales indexed at 1 and 2. 0.1-0.2 3-4 Low frequencies. Scales indexed at 3 and 4. 0.4-0.8 1-4 All low frequencies. Scales indexed at 1, 2, 3, and 4. 0.1-0.8 5-6 Middle frequencies. Scales indexed at 5 and 6. 1.6-3.2 7-8 Middle frequencies. Scales indexed at 7 and 8. 6.3-13 5-8 All middle frequencies. Scales indexed at 5, 6, 7, and 8. 1.6-13 9-10 High frequencies. Scales indexed at 9 and 10. 25-50 MSD-HMM f0 signal predicted from an MSD-HMM.
F0 reconstruction f0(x) =
10
wiCi(x)(i + 2.5)−5/2 where wi is the weight given to scale i where wi ∈ {0, 1}
7 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
8 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
9 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
10 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
11 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
12 / 29
Experiment 2 - Similarity
conditions
consecutive pairs
terms of naturalness
13 / 29
Experiment 2 - Similarity
2-dimensional space with MDS
14 / 29
Experiment 2 - Similarity
15 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
16 / 29
Experiment 3 - MUSHRA
100).
reference was not judged as natural.
17 / 29
Experiment 3 - MUSHRA
18 / 29
Main Conclusions
[Suni et al, 2013], [Ribeiro and Clark, 2015]
MOS tests)
19 / 29
Earlier assumptions
Earlier assumptions [Suni et al, 2013], [Ribeiro and Clark, 2015]
1 All wavelet components are equally relevant to the
reconstructed signal
2 The association of wavelet components to linguistic levels is
meaningful
20 / 29
Unit and Peak Rates
utterances
duration in seconds
21 / 29
Summary
signal
features
22 / 29
A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis
School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk
8 September 2015
23 / 29
24 / 29
Perceptual Experiments
Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references.
25 / 29
Experiment 1 - Prominence
stimulus response ... John won at Mary’s. Paul won at Mary’s. John won at Mary’s. John lost at Mary’s. John won at Mary’s. John won at Kate’s. John won at Mary’s.
26 / 29
Experiment 1 - Prominence
information
accuracy
27 / 29
Experiment 1 - Prominence
28 / 29
References
[Ribeiro and Clark, 2015] Ribeiro, M.S., & Clark, R. (2015). A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform.
[Suni et al, 2013] Suni, A. S., Aalto, D., Raitio, T., Alku, P., & Vainio, M. (2013). Wavelets for intonation modeling in HMM speech synthesis. 8th ISCA Workshop on Speech Synthesis. [Farouk, 2014] Farouk, M. H. (2014). Application of Wavelets in Speech Processing. Springer, New York. [Sanchez et al, 2014] Sanchez, G., Silen, H., Nurminen, J., & Gabbouj, M. (2014). Hierarchical modeling of F0 contours for voice conversion. Fifteenth Annual Conference of the International Speech Communication Association. [Vainio et al, 2013] Vainio, M., Suni, A., & Aalto, D. (2013). Continuous wavelet transform for analysis of speech prosody. TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, Aix-en-Provence, France. [Cole et al, 2010] Cole, J., Mo, Y., & Hasegawa-Johnson, M. (2010). Signal-based and expectation-based factors in the perception of prosodic prominence. Laboratory Phonology, 1(2). 29 / 29