Robust TTS duration modelling using DNNs
Gustav Eje Henter Srikanth Ronanki Oliver Watts Mirjam Wester Zhizheng Wu Simon King
1 of 33
Robust TTS duration modelling using DNNs Gustav Eje Henter Srikanth - - PowerPoint PPT Presentation
Robust TTS duration modelling using DNNs Gustav Eje Henter Srikanth Ronanki Oliver Watts Mirjam Wester Zhizheng Wu Simon King 1 of 33 Synopsis 1. Statistical parametric speech synthesis is sensitive to bad data and bad assumptions 2.
Gustav Eje Henter Srikanth Ronanki Oliver Watts Mirjam Wester Zhizheng Wu Simon King
1 of 33
and bad assumptions
from found audiobook data
2 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
3 of 33
4 of 33
5 of 33
Generate some datapoints D
6 of 33
Fit a Gaussian using maximum likelihood
6 of 33
Add an unexpected datapoint
6 of 33
The maximum likelihood fit changes a lot!
6 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
7 of 33
The word “robust” can mean many things
from modelling assumptions
poorly-fitting datapoints
robust statistical techniques
8 of 33
phone basis
good/bad fit
9 of 33
10 of 33
Assume phone durations are independent and follow a GMM fD (d; θ) =
K
ωk · fN(d; µk, diag(σ2
k))
k}K k=1 depend on l
through a DNN θ (l; W ) with weights W
11 of 33
The network is typically trained using maximum likelihood
W
ln fD(d p; θ(l p; W )) Output durations are typically generated from the mode of the predicted distribution
d
fD(d; θ(l; W ))
12 of 33
We describe two methods to create speech with robust durations:
13 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
14 of 33
Additional components can absorb outlying datapoints
15 of 33
Only generate from a single component: kmax (l) = argmax
k
ωk (l)
d
fN(d; µkmax (l) , diag(σ2
kmax (l)))
the output
Zen and Senior (2014)
16 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
17 of 33
By changing the estimation principle away from MLE, we can get robustness with mathematical guarantees
model
18 of 33
In this work, we consider the estimation principle
W
− β 1 + β ˆ (fD(x; θ(l p; W )))1+βdx
so-called density power divergence or β-divergence
19 of 33
One can show that β-estimation is:
a certain estimation accuracy
MLE is recovered in the limit β → 0
20 of 33
Gaussian distribution fit using β = 1
21 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
22 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
23 of 33
found TTS data (≈ 3 hours)
(Theano)
24 of 33
VOC Vocoded held-out natural speech (top line) Same acoustic DNN, but different duration models: FRC Synthesised speech with oracle durations (forced-aligned to VOC) BOT Mean monophone duration (bottom line) MSE MMSE DNN (baseline) MLE1 Single-component, deep MDN maximising likelihood
25 of 33
MLE3 Three-component (K = 3), deep MDN maximising likelihood; only the maximum-weight component is used for synthesis B75 Single-component, deep MDN optimising β-divergence, set to include approximately 75% of datapoints (β = 0.358) B50 Single-component, deep MDN optimising β-divergence, set to include approximately 50% of datapoints (β = 0.663)
26 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
27 of 33
RMSE with respect to FRC on test-data subsets:
28 of 33
Relative RMSE on test-data subsets (with BOT at 1.0):
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of least-residual datapoints retained 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 RMSE relative to bottom line MSE MLE1 MLE3 B75 B50
28 of 33
29 of 33
Test results, after converting to ranks (higher is better):
VOC FRC BOT MSE MLE1 MLE3 B75 B50 1 2 3 4 5 6 7 8
30 of 33
majority of the datapoints
typical speech
methods
31 of 33
2.1 MDN generation 2.2 β-estimation
3.1 Setup 3.2 Results
32 of 33
33 of 33
33 of 33
acoustic modeling in statistical parametric speech synthesis,” in Proc. ICASSP, 2014, pp. 3844–3848.
efficient estimation by minimising a density power divergence,” Biometrika, vol. 85, no. 3, pp. 549–559, 1998.
35 of 33
Example utterance from held-out chapter: VOC FRC BOT MSE MLE1 MLE3 B75 B50
36 of 33
Audiobooks are a classic source of found TTS data
37 of 33
[0.01, 0.99]
38 of 33
state durations d
feature distributions
trajectories
39 of 33
covariances
40 of 33
Deep MDN code courtesy of Zhizheng Wu
41 of 33
Current research directions:
Journal paper in preparation
42 of 33