[PPT] - Where do the improvements come from in sequence-to-sequence neural PowerPoint Presentation

SLIDE 1

Where do the improvements come from in sequence-to-sequence neural TTS?

Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

SLIDE 2

Attention Pre-net CBHG

Character embeddings

Attention RNN Decoder RNN

Pre-net

Attention RNN Decoder RNN

Pre-net

Attention RNN Decoder RNN

Pre-net CBHG

Linear-scale spectrogram Seq2seq target with r=3

Griffin-Lim reconstruction

Attention is applied to all decoder steps

<GO> frame

TACOTRON: TOWARDS END-TO-END SPEECH SYN-

THESIS

Yuxuan Wang∗, RJ Skerry-Ryan∗, Daisy Stanton, Yonghui Wu, Ron J. Weiss†, Navdeep Jaitly, Zongheng Yang, Ying Xiao∗, Zhifeng Chen, Samy Bengio†, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous∗ Google, Inc. {yxwang,rjryan,rif}@google.com

ABSTRACT

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Build- ing these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end genera- tive text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with ran- dom initialization. We present several key techniques to make the sequence-to- sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a pro- duction parametric system in terms of naturalness. In addition, since Tacotron

arXiv:1703.10135v2 [cs.CL] 6 Apr 2017

STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS Heiga Zen, Andrew Senior, Mike Schuster Google

fheigazen,andrewsenior,schusterg@google.com ABSTRACT Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context

dependencies. This paper examines an alternative scheme that is

based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional

approach. Experimental results show that the DNN-based systems
utperformed the HMM-based systems with similar numbers of

parameters. Index Terms— Statistical parametric speech synthesis; Hidden Markov model; Deep neural network;

1. INTRODUCTION

Statistical parametric speech synthesis based on hidden Markov models (HMMs) [1] has grown in popularity in the last decade. This approach has various advantages over the concatenative speech synthesis approach [2], such as the flexibility to change its voice charac- teristics, [3–6], small footprint [7–9], and robustness [10]. However its major limitation is the quality of the synthesized speech. Zen HMM through a binary decision tree, where one context-related binary question is associated with each non-terminal node. The number of clusters, namely the number of terminal nodes, determines the model complexity. The decision tree is constructed by sequen- tially selecting the questions which yield the largest log likelihood gain of the training data. The size of the tree is controlled using a pre-determined threshold of log likelihood gain, a model complexity penalty [14,15], or cross validation [16,17]. With the use of context- related questions and state parameter sharing, the unseen contexts and data sparsity problems are effectively addressed. As the method has been successfully used in speech recognition, HMM-based statistical parametric speech synthesis naturally employs a similar approach to model very rich contexts. Although the decision tree-clustered context-dependent HMMs work reasonably effectively in statistical parametric speech synthesis, there are some limitations. First, it is inefficient to express complex context dependencies such as XOR, parity or multiplex problems by decision trees [18]. To represent such cases, decision trees will be prohibitively large. Second, this approach divides the input space and use separate parameters for each region, with each region associated with a terminal node of the decision tree. This results in fragmenting the training data and reducing the amount of the data that can be used in clustering the other contexts and estimating the distributions [19]. Having a prohibitively large tree and fragmenting training data will both lead to overfitting and degrade the quality of the synthesized speech. To address these limitations, this paper examines an alternative scheme that is based on a deep architecture [20]. The decision trees

com-

neu-

de- functions mul- trees that ].

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

2013: ‘Old paradigm’ 2017: ‘New paradigm’

SLIDE 3

Old paradigm New paradigm

SLIDE 4

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

SLIDE 5

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

Naturalness rating

SLIDE 6

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

W h e r e d

t

h e i m p r

v

e m e n t s   c

m

e f r

m

?

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

Vocoder: World vs. Griffin-Lim Loss functions

SLIDE 24

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

SLIDE 25

Acoustic model Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment

SLIDE 26

Acoustic model Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 27

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 28

Audio decoder Audio encoder

attention

Duration model/forced alignment Text encoder Front end

SLIDE 29

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 30

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 31

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 32

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 33

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 34

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 35

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

SLIDE 36

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

W h e r e d

t

h e i m p r

v

e m e n t s   c

m

e f r

m

?

Naturalness rating

SLIDE 37

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

SLIDE 38

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

Samples

LJSpeech
MUS(HRA) listening test
24 paid native speakers
60 Harvard sentences

SLIDE 39

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r not significant D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

LJSpeech
MUS(HRA) listening test
24 paid native speakers
60 Harvard sentences

SLIDE 40

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r not significant D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

Main findings

SLIDE 41

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

not significant

Naturalness rating

1. A learned front-end (text

encoder) always improves quality (with or without   attention, but more so with attention)

2. Acoustic feedback has a

very strong beneficial impact

n quality
3. Attention ‘breaks’ without a

learned front-end and helps  (but not significantly) with a   learned front-end

Main findings

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

SLIDE 42

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

not significant

Naturalness rating

1. A learned front-end (text

encoder) always improves quality (with or without   attention, but more so with attention)

2. Acoustic feedback has a

very strong beneficial impact

n quality
3. Attention ‘breaks’ without a

learned front-end and helps  (but not significantly) with a   learned front-end

Main findings

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

SLIDE 43

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r not significant D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

1. A learned front-end (text

encoder) always improves quality (with or without   attention, but more so with attention)

2. Acoustic feedback has a

very strong beneficial impact

n quality
3. Attention ‘breaks’ without a

learned front-end and helps  (but not significantly) with a   learned front-end

Naturalness rating

Main findings

SLIDE 44

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

SLIDE 45

Normalised subjective rating M MM W2 W2T W2H G2 G1 G1H G1TH G1HAG1THA 10 20 30 40 50 60 70 80 90 100

Naturalness rating

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

SLIDE 46

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

D C T T S m i n u s a t t e n t i

n

a n d t e x t e n c

d

e r D C T T S m i n u s a t t e n t i

n

, t e x t e n c

d

e r   a n d a c

u

s t i c h i s t

r

y

Naturalness rating

SLIDE 47

Text encoder Attention Autoregression ✔ ❓ ✔

SLIDE 48

Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

Where do the improvements come from in sequence-to-sequence neural TTS?

Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

2013: ‘Old paradigm’ 2017: ‘New paradigm’

Old paradigm New paradigm

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

Old paradigm

Old paradigm

Old paradigm

Old paradigm

Old paradigm

Old paradigm

Old paradigm

Old paradigm New paradigm

Old paradigm New paradigm

Old paradigm New paradigm

Old paradigm New paradigm

Old paradigm New paradigm

Old paradigm New paradigm

T: text encoder vs. front end

Text encoder vs. front end

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

Vocoder: World vs. Griffin-Lim Loss functions

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

M e r l i n D C T T S

M e r l i n D C T T S

M e r l i n D C T T S

Samples

M e r l i n D C T T S

M e r l i n D C T T S

Main findings

encoder) always improves quality (with or without attention, but more so with attention)

very strong beneficial impact

learned front-end and helps (but not significantly) with a learned front-end

Main findings

M e r l i n D C T T S

encoder) always improves quality (with or without attention, but more so with attention)

very strong beneficial impact

learned front-end and helps (but not significantly) with a learned front-end

Main findings

M e r l i n D C T T S

M e r l i n D C T T S

encoder) always improves quality (with or without attention, but more so with attention)

very strong beneficial impact

learned front-end and helps (but not significantly) with a learned front-end

Main findings

M e r l i n D C T T S

M e r l i n D C T T S

M e r l i n D C T T S

Text encoder Attention Autoregression ✔ ❓ ✔

encoder) always improves quality (with or without   attention, but more so with attention)

learned front-end and helps  (but not significantly) with a   learned front-end

encoder) always improves quality (with or without   attention, but more so with attention)

learned front-end and helps  (but not significantly) with a   learned front-end

encoder) always improves quality (with or without   attention, but more so with attention)

learned front-end and helps  (but not significantly) with a   learned front-end