Measuring the Perceptual Effects of Speech Synthesis Modelling - PowerPoint PPT Presentation

Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29

Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 2 of 29

Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 1. Through manipulating repeated natural speech 2 of 29

Summary “Hear the perceptual effects of modelling assumptions in statistical speech synthesis” 1. Through manipulating repeated natural speech 2. Identify which assumptions that limit synthesiser naturalness 2 of 29

Overview 1. Background 2. Methodology 3. Experiments 4. Conclusions and outlook 3 of 29

Naturalness in speech synthesis Output naturalness depends on many factors: • Text processing • Speech parameter representation (vocoder etc.) • Probabilistic models • Parameter generation method 4 of 29

Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent 5 of 29

Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent • Low-level assumptions ◦ A particular decision tree partitioning of linguistic contexts ◦ Leaf node distributions are Gaussian 5 of 29

Modelling assumptions Acoustic models make many assumptions: • High-level assumptions ◦ Different parameter streams are conditionally independent ◦ Filter parameter trajectories are conditionally independent • Low-level assumptions ◦ A particular decision tree partitioning of linguistic contexts ◦ Leaf node distributions are Gaussian Assumption adequacy affects output naturalness 5 of 29

Questions 1. Which high-level assumptions hurt naturalness? 2. How much may we gain if we could remove these assumptions? 6 of 29

Questions 1. Which high-level assumptions hurt naturalness? 2. How much may we gain if we could remove these assumptions? → Where should we direct our improvement efforts? 6 of 29

Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after 7 of 29

Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after Problems: • Impressions are coloured by other imperfections ◦ Low-level assumptions ◦ Estimation errors 7 of 29

Traditional fault-finding Investigate naturalness through trial-and-error: 1. Select an assumption and modify it 2. Compare output naturalness before and after Problems: • Impressions are coloured by other imperfections ◦ Low-level assumptions ◦ Estimation errors • Does not compare the relative severity of different assumptions 7 of 29

Our insight • Natural speech is a sample from the true acoustic model 8 of 29

Our insight • Natural speech is a sample from the true acoustic model • By manipulating repeated natural speech we can simulate output from ◦ highly accurate models • only incorporating certain high-level modelling assumptions • no low-level assumptions at all 8 of 29

Our insight • Natural speech is a sample from the true acoustic model • By manipulating repeated natural speech we can simulate output from ◦ highly accurate models • only incorporating certain high-level modelling assumptions • no low-level assumptions at all ◦ with a particular parameter representation ◦ and a particular output generation method 8 of 29

Why is this cool? Nobody knows what these “nearly perfect” models are, yet we can listen to their output! 9 of 29

Why is this cool? Nobody knows what these “nearly perfect” models are, yet we can listen to their output! • Compare naturalness degradations due to different high-level assumptions in an otherwise perfect model • Identified key naturalness bottlenecks in speech synthesis 9 of 29

Overview 1. Background 2. Methodology 3. Experiments 4. Conclusions and outlook 10 of 29

Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways 11 of 29

Repeated speech Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls” 11 of 29

REHASP 0.5 corpus • “REpeated HArvard Sentence Prompts” • Female British English talker “Lucy” • 30 Harvard sentence prompts • Each read aloud 40 times ◦ Presented in random order • Recorded at 16 bit 96 kHz • Publicly available under a permissive license ◦ datashare.is.ed.ac.uk/handle/10283/561 12 of 29

In pictures 0. Start with natural speech repetitions: 13 of 29

In pictures 1. Extract parameters: R e p e t i t i o n 1 13 of 29

Speech representation Standard parametric speech representation used for experiments: • 16 kHz operating point • Matlab STRAIGHT for parameter extraction • 46-dimensional parameter vector with three streams: ◦ 40 MCEPs (0–39), representing filter coefficients ◦ Log-F0 ◦ 5 band aperiodicities (BAPs) • 5 ms frame shift 14 of 29

In pictures 1.b. Resynthesise (baseline “V”): R e p e t i t i o n 1 15 of 29

In pictures 1. Extract parameters: R e p e t i t i o n 1 15 of 29

In pictures 1. Extract parameters: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 15 of 29

Match timings 2.a. Match frames: R e p e t i t i o n 1 R e p e t i t i o n 2 R e p e t i t i o n 3 16 of 29

Match timings 2.a. Match frames: 16 of 29

Match timings 2.b. Warp timings: 16 of 29

Match timings 2.c. Resynthesise (baseline “D”): 16 of 29

Match timings 2.d. Remove reference: 16 of 29

Match timings We now have “LEGO pieces” of aligned repetitions 16 of 29

Create chimeric speech 3.a. Combine parameters from independent repetitions: 17 of 29

Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 1 F i l t e r 3 S o u r c e 3 17 of 29

Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 3 17 of 29

Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 F i l t e r 1 S o u r c e 3 S o u r c e 3 17 of 29

Create chimeric speech 3.a. Combine parameters from independent repetitions: F i l t e r 1 S o u r c e 3 17 of 29

Measuring the Perceptual Effects of Speech Synthesis Modelling - PowerPoint PPT Presentation

Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29 Summary Hear the perceptual effects of modelling assumptions in statistical

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Modern speech synthesis and its implications for speech sciences Zofia Malisz 1 , Gustav Eje

History and Principles of Data Visualization (CMSC 34900-1 Topics in Scientific Computing;

Global Risk Regulation Alberto Alemanno HEC Paris Setting the scene A few words on:

Programming Behavior Rod Grupen Department of Computer Science University of Massachusetts

Some Remarks on Text Data Visualization and Codec Transparency Bryan Jurish jurish@bbaw.de

Probabilistic illumination-aware filtering for Monte Carlo rendering Ian C. Doidge Mark W. Jones

Im Image Form rmation, , Basic ic Im Image Processing Wei-Chih Tu ( ) National