Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions
Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King
1 of 29
Measuring the Perceptual Effects of Speech Synthesis Modelling - - PowerPoint PPT Presentation
Measuring the Perceptual Effects of Speech Synthesis Modelling Assumptions Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King 1 of 29 Summary Hear the perceptual effects of modelling assumptions in statistical
Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, Simon King
1 of 29
“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”
2 of 29
“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”
2 of 29
“Hear the perceptual effects of modelling assumptions in statistical speech synthesis”
2 of 29
3 of 29
Output naturalness depends on many factors:
4 of 29
Output naturalness depends on many factors:
4 of 29
Acoustic models make many assumptions:
5 of 29
Acoustic models make many assumptions:
5 of 29
Acoustic models make many assumptions:
Assumption adequacy affects output naturalness
5 of 29
6 of 29
→ Where should we direct our improvement efforts?
6 of 29
Investigate naturalness through trial-and-error:
7 of 29
Investigate naturalness through trial-and-error:
Problems:
7 of 29
Investigate naturalness through trial-and-error:
Problems:
7 of 29
8 of 29
8 of 29
8 of 29
Nobody knows what these “nearly perfect” models are, yet we can listen to their output!
9 of 29
Nobody knows what these “nearly perfect” models are, yet we can listen to their output!
assumptions in an otherwise perfect model
9 of 29
10 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways
11 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”
11 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”
11 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”
11 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”
11 of 29
Even when controlling for context, the same text can be realised acoustically in many different ways “Rice is often served in round bowls”
11 of 29
12 of 29
13 of 29
R e p e t i t i
1
13 of 29
R e p e t i t i
1
13 of 29
Standard parametric speech representation used for experiments:
14 of 29
1.b. Resynthesise (baseline “V”):
R e p e t i t i
1
15 of 29
1.b. Resynthesise (baseline “V”):
R e p e t i t i
1
15 of 29
R e p e t i t i
1
15 of 29
R e p e t i t i
2 R e p e t i t i
3 R e p e t i t i
1
15 of 29
R e p e t i t i
2 R e p e t i t i
3 R e p e t i t i
1
15 of 29
R e p e t i t i
2 R e p e t i t i
3 R e p e t i t i
1
15 of 29
2.a. Match frames:
R e p e t i t i
2 R e p e t i t i
3 R e p e t i t i
1
16 of 29
2.a. Match frames:
16 of 29
2.a. Match frames:
16 of 29
2.a. Match frames:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.b. Warp timings:
16 of 29
2.c. Resynthesise (baseline “D”):
16 of 29
2.d. Remove reference:
16 of 29
2.d. Remove reference:
16 of 29
2.d. Remove reference:
16 of 29
We now have “LEGO pieces” of aligned repetitions
16 of 29
3.a. Combine parameters from independent repetitions:
17 of 29
3.a. Combine parameters from independent repetitions:
17 of 29
3.a. Combine parameters from independent repetitions:
F i l t e r 3 S
r c e 1 F i l t e r 1 S
r c e 3
17 of 29
3.a. Combine parameters from independent repetitions:
F i l t e r 3 S
r c e 1 F i l t e r 1 S
r c e 3
17 of 29
3.a. Combine parameters from independent repetitions:
F i l t e r 1 S
r c e 3
17 of 29
3.a. Combine parameters from independent repetitions:
F i l t e r 1 S
r c e 3 F i l t e r 1 S
r c e 3
17 of 29
3.a. Combine parameters from independent repetitions:
F i l t e r 1 S
r c e 3
17 of 29
3.a. Resynthesise chimeric speech (here condition “SF”):
F i l t e r 1 S
r c e 3
17 of 29
3.b. Take the mean of all repetitions:
18 of 29
3.b. Take the mean of all repetitions:
R e p e t i t i
3 R e p e t i t i
1
18 of 29
3.b. Take the mean of all repetitions:
R e p e t i t i
3 R e p e t i t i
1 Me a n
18 of 29
3.b. Resynthesise mean speech (condition “M”):
R e p e t i t i
3 R e p e t i t i
1 Me a n
18 of 29
acoustic model
high-level assumptions but no low-level assumptions
19 of 29
20 of 29
1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent
21 of 29
1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent
21 of 29
1.1 Source and filter parameters independent 1.2 Filter, pitch, aperiodicities independent
= 12 conditions (4 baselines)
21 of 29
Sampling-based generation: Database examples: 3 7 26 32 Baselines: N VU V D Stream independence: SF SI Filter coefficient independence: L1 L2 H1 H2 I Mean-based generation: Averaging: M (Also available online at homepages.inf.ed.ac.uk/ghenter)
22 of 29
MUSHRA test for parallel, fine-grained naturalness assessment
23 of 29
Box plot of 549 comparisons rating natural speech at 100:
10 20 30 40 50 60 70 80 90 100 VU V D SF SI I M
24 of 29
25 of 29
26 of 29
26 of 29
reduces naturalness
26 of 29
reduces naturalness
26 of 29
reduces naturalness
26 of 29
reduces naturalness
26 of 29
Conclusions not applicable to:
27 of 29
28 of 29
28 of 29
28 of 29