Where do the improvements come from in sequence-to-sequence neural - - PowerPoint PPT Presentation
Where do the improvements come from in sequence-to-sequence neural - - PowerPoint PPT Presentation
Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts Gustav Eje Henter Jason Fong Cassia Valentini-Botinhao Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer
Attention Pre-net CBHG
Character embeddings
Attention RNN Decoder RNN
Pre-net
Attention RNN Decoder RNN
Pre-net
Attention RNN Decoder RNN
Pre-net CBHG
Linear-scale spectrogram Seq2seq target with r=3
Griffin-Lim reconstruction
Attention is applied to all decoder steps
<GO> frame
TACOTRON: TOWARDS END-TO-END SPEECH SYN-
THESIS
Yuxuan Wang∗, RJ Skerry-Ryan∗, Daisy Stanton, Yonghui Wu, Ron J. Weiss†, Navdeep Jaitly, Zongheng Yang, Ying Xiao∗, Zhifeng Chen, Samy Bengio†, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous∗ Google, Inc. {yxwang,rjryan,rif}@google.com
ABSTRACT
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Build- ing these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end genera- tive text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with ran- dom initialization. We present several key techniques to make the sequence-to- sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a pro- duction parametric system in terms of naturalness. In addition, since Tacotron
arXiv:1703.10135v2 [cs.CL] 6 Apr 2017
STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS Heiga Zen, Andrew Senior, Mike Schuster Google
fheigazen,andrewsenior,schusterg@google.com ABSTRACT Conventional approaches to statistical parametric speech synthe- sis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limita- tions, e.g. decision trees are inefficient to model complex context
- dependencies. This paper examines an alternative scheme that is
based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional
- approach. Experimental results show that the DNN-based systems
- utperformed the HMM-based systems with similar numbers of
parameters. Index Terms— Statistical parametric speech synthesis; Hidden Markov model; Deep neural network;
- 1. INTRODUCTION
Statistical parametric speech synthesis based on hidden Markov models (HMMs) [1] has grown in popularity in the last decade. This approach has various advantages over the concatenative speech syn- thesis approach [2], such as the flexibility to change its voice charac- teristics, [3–6], small footprint [7–9], and robustness [10]. However its major limitation is the quality of the synthesized speech. Zen HMM through a binary decision tree, where one context-related bi- nary question is associated with each non-terminal node. The num- ber of clusters, namely the number of terminal nodes, determines the model complexity. The decision tree is constructed by sequen- tially selecting the questions which yield the largest log likelihood gain of the training data. The size of the tree is controlled using a pre-determined threshold of log likelihood gain, a model complexity penalty [14,15], or cross validation [16,17]. With the use of context- related questions and state parameter sharing, the unseen contexts and data sparsity problems are effectively addressed. As the method has been successfully used in speech recognition, HMM-based sta- tistical parametric speech synthesis naturally employs a similar ap- proach to model very rich contexts. Although the decision tree-clustered context-dependent HMMs work reasonably effectively in statistical parametric speech synthe- sis, there are some limitations. First, it is inefficient to express com- plex context dependencies such as XOR, parity or multiplex prob- lems by decision trees [18]. To represent such cases, decision trees will be prohibitively large. Second, this approach divides the input space and use separate parameters for each region, with each region associated with a terminal node of the decision tree. This results in fragmenting the training data and reducing the amount of the data that can be used in clustering the other contexts and estimating the distributions [19]. Having a prohibitively large tree and fragmenting training data will both lead to overfitting and degrade the quality of the synthesized speech. To address these limitations, this paper examines an alternative scheme that is based on a deep architecture [20]. The decision trees
com-
- neu-
de- functions mul- trees that ].
Input layer Output layer Hidden layers TEXT
SPEECH
Parameter generation
... ... ...
Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction
...
Statistics (mean & var) of speech parameter vector sequence
x1
1
x1
2
x1
3
x1
4
xT
1
xT
2
xT
3
xT
4
h1
11
h1
12
h1
13
h1
14
hT
11
hT
12
hT
13
hT
14
y1
1
y1
2
y1
3
yT
1
yT
2
yT
3
h1
31
h1
32
h1
33
h1
34
hT
31
hT
32
hT
33
hT
34
...
h1
21
h1
22
h1
23
h1
24
hT
21
hT
22
hT
23
hT
24
2013: ‘Old paradigm’ 2017: ‘New paradigm’
Old paradigm New paradigm
Old paradigm New paradigm Merlin
Wu et al. 2016
DCTTS
Tachibana et al. 2018
github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin
Old paradigm New paradigm Merlin
Wu et al. 2016
DCTTS
Tachibana et al. 2018
github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
Naturalness rating
Old paradigm New paradigm Merlin
Wu et al. 2016
DCTTS
Tachibana et al. 2018
github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
W h e r e d
- t
h e i m p r
- v
e m e n t s c
- m
e f r
- m
?
Naturalness rating
Old paradigm
Front end
Old paradigm
Front end Duration model/forced alignment
Old paradigm
Front end Duration model/forced alignment Acoustic model
Old paradigm
Front end Duration model/forced alignment Acoustic model
Old paradigm
Front end Duration model/forced alignment Acoustic model
Old paradigm
Front end Duration model/forced alignment Acoustic model
Old paradigm
Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder
attention
Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
Old paradigm New paradigm
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
T: text encoder vs. front end
Text encoder vs. front end
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
T: text encoder vs. front end
Text encoder vs. front end Attention vs. precomputed alignment
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
T: text encoder vs. front end
Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
T: text encoder vs. front end
Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence
Vocoder: World vs. Griffin-Lim Loss functions
Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Acoustic model
T: text encoder vs. front end
Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence
Acoustic model Text encoder Audio decoder Audio encoder
attention
Front end Duration model/forced alignment
Acoustic model Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Duration model/forced alignment Text encoder Front end
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
Audio decoder Audio encoder
attention
Front end Duration model/forced alignment Text encoder
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
W h e r e d
- t
h e i m p r
- v
e m e n t s c
- m
e f r
- m
?
Naturalness rating
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
Samples
- LJSpeech
- MUS(HRA) listening test
- 24 paid native speakers
- 60 Harvard sentences
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r not significant D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
- LJSpeech
- MUS(HRA) listening test
- 24 paid native speakers
- 60 Harvard sentences
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r not significant D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
Main findings
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
not significant
Naturalness rating
- 1. A learned front-end (text
encoder) always improves quality (with or without attention, but more so with attention)
- 2. Acoustic feedback has a
very strong beneficial impact
- n quality
- 3. Attention ‘breaks’ without a
learned front-end and helps (but not significantly) with a learned front-end
Main findings
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
not significant
Naturalness rating
- 1. A learned front-end (text
encoder) always improves quality (with or without attention, but more so with attention)
- 2. Acoustic feedback has a
very strong beneficial impact
- n quality
- 3. Attention ‘breaks’ without a
learned front-end and helps (but not significantly) with a learned front-end
Main findings
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r not significant D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
- 1. A learned front-end (text
encoder) always improves quality (with or without attention, but more so with attention)
- 2. Acoustic feedback has a
very strong beneficial impact
- n quality
- 3. Attention ‘breaks’ without a
learned front-end and helps (but not significantly) with a learned front-end
Naturalness rating
Main findings
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
Normalised subjective rating M MM W2 W2T W2H G2 G1 G1H G1TH G1HAG1THA 10 20 30 40 50 60 70 80 90 100
Naturalness rating
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100
M e r l i n D C T T S
D C T T S m i n u s t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
D C T T S m i n u s a t t e n t i
- n
a n d t e x t e n c
- d
e r D C T T S m i n u s a t t e n t i
- n
, t e x t e n c
- d
e r a n d a c
- u
s t i c h i s t
- r
y
Naturalness rating
Text encoder Attention Autoregression ✔ ❓ ✔
Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao