Where do the improvements come from in sequence-to-sequence neural - - PowerPoint PPT Presentation

where do the improvements come from in sequence to
SMART_READER_LITE
LIVE PREVIEW

Where do the improvements come from in sequence-to-sequence neural - - PowerPoint PPT Presentation

Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts Gustav Eje Henter Jason Fong Cassia Valentini-Botinhao Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer


slide-1
SLIDE 1

Where do the improvements come from in sequence-to-sequence neural TTS?

Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

slide-2
SLIDE 2

Attention Pre-net CBHG

Character embeddings

Attention RNN Decoder RNN

Pre-net

Attention RNN Decoder RNN

Pre-net

Attention RNN Decoder RNN

Pre-net CBHG

Linear-scale spectrogram Seq2seq target with r=3

Griffin-Lim reconstruction

Attention is applied to all decoder steps

<GO> frame

TACOTRON: TOWARDS END-TO-END SPEECH SYN-

THESIS

Yuxuan Wang∗, RJ Skerry-Ryan∗, Daisy Stanton, Yonghui Wu, Ron J. Weiss†, Navdeep Jaitly, Zongheng Yang, Ying Xiao∗, Zhifeng Chen, Samy Bengio†, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous∗ Google, Inc. {yxwang,rjryan,rif}@google.com

ABSTRACT

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Build- ing these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end genera- tive text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with ran- dom initialization. We present several key techniques to make the sequence-to- sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a pro- duction parametric system in terms of naturalness. In addition, since Tacotron

arXiv:1703.10135v2 [cs.CL] 6 Apr 2017

STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS Heiga Zen, Andrew Senior, Mike Schuster Google

fheigazen,andrewsenior,schusterg@google.com ABSTRACT Conventional approaches to statistical parametric speech synthe- sis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limita- tions, e.g. decision trees are inefficient to model complex context

  • dependencies. This paper examines an alternative scheme that is

based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional

  • approach. Experimental results show that the DNN-based systems
  • utperformed the HMM-based systems with similar numbers of

parameters. Index Terms— Statistical parametric speech synthesis; Hidden Markov model; Deep neural network;

  • 1. INTRODUCTION

Statistical parametric speech synthesis based on hidden Markov models (HMMs) [1] has grown in popularity in the last decade. This approach has various advantages over the concatenative speech syn- thesis approach [2], such as the flexibility to change its voice charac- teristics, [3–6], small footprint [7–9], and robustness [10]. However its major limitation is the quality of the synthesized speech. Zen HMM through a binary decision tree, where one context-related bi- nary question is associated with each non-terminal node. The num- ber of clusters, namely the number of terminal nodes, determines the model complexity. The decision tree is constructed by sequen- tially selecting the questions which yield the largest log likelihood gain of the training data. The size of the tree is controlled using a pre-determined threshold of log likelihood gain, a model complexity penalty [14,15], or cross validation [16,17]. With the use of context- related questions and state parameter sharing, the unseen contexts and data sparsity problems are effectively addressed. As the method has been successfully used in speech recognition, HMM-based sta- tistical parametric speech synthesis naturally employs a similar ap- proach to model very rich contexts. Although the decision tree-clustered context-dependent HMMs work reasonably effectively in statistical parametric speech synthe- sis, there are some limitations. First, it is inefficient to express com- plex context dependencies such as XOR, parity or multiplex prob- lems by decision trees [18]. To represent such cases, decision trees will be prohibitively large. Second, this approach divides the input space and use separate parameters for each region, with each region associated with a terminal node of the decision tree. This results in fragmenting the training data and reducing the amount of the data that can be used in clustering the other contexts and estimating the distributions [19]. Having a prohibitively large tree and fragmenting training data will both lead to overfitting and degrade the quality of the synthesized speech. To address these limitations, this paper examines an alternative scheme that is based on a deep architecture [20]. The decision trees

com-

  • neu-

de- functions mul- trees that ].

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

2013: ‘Old paradigm’ 2017: ‘New paradigm’

slide-3
SLIDE 3

Old paradigm New paradigm

slide-4
SLIDE 4

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

slide-5
SLIDE 5

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

Naturalness rating

slide-6
SLIDE 6

Old paradigm New paradigm Merlin

Wu et al. 2016

DCTTS

Tachibana et al. 2018

github.com/Kyubyong/dc_tts github.com/CSTR-Edinburgh/merlin

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

W h e r e d

  • t

h e i m p r

  • v

e m e n t s 
 c

  • m

e f r

  • m

?

Naturalness rating

slide-7
SLIDE 7

Old paradigm

slide-8
SLIDE 8

Front end

Old paradigm

slide-9
SLIDE 9

Front end Duration model/forced alignment

Old paradigm

slide-10
SLIDE 10

Front end Duration model/forced alignment Acoustic model

Old paradigm

slide-11
SLIDE 11

Front end Duration model/forced alignment Acoustic model

Old paradigm

slide-12
SLIDE 12

Front end Duration model/forced alignment Acoustic model

Old paradigm

slide-13
SLIDE 13

Front end Duration model/forced alignment Acoustic model

Old paradigm

slide-14
SLIDE 14

Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-15
SLIDE 15

Text encoder Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-16
SLIDE 16

Text encoder

attention

Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-17
SLIDE 17

Text encoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-18
SLIDE 18

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-19
SLIDE 19

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

Old paradigm New paradigm

slide-20
SLIDE 20

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end

slide-21
SLIDE 21

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment

slide-22
SLIDE 22

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

slide-23
SLIDE 23

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

Vocoder: World vs. Griffin-Lim Loss functions

slide-24
SLIDE 24

Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Acoustic model

T: text encoder vs. front end

Text encoder vs. front end Attention vs. precomputed alignment Autoregression vs. frame independence

slide-25
SLIDE 25

Acoustic model Text encoder Audio decoder Audio encoder

attention

Front end Duration model/forced alignment

slide-26
SLIDE 26

Acoustic model Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-27
SLIDE 27

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-28
SLIDE 28

Audio decoder Audio encoder

attention

Duration model/forced alignment Text encoder Front end

slide-29
SLIDE 29

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-30
SLIDE 30

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-31
SLIDE 31

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-32
SLIDE 32

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-33
SLIDE 33

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-34
SLIDE 34

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-35
SLIDE 35

Audio decoder Audio encoder

attention

Front end Duration model/forced alignment Text encoder

slide-36
SLIDE 36

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

W h e r e d

  • t

h e i m p r

  • v

e m e n t s 
 c

  • m

e f r

  • m

?

Naturalness rating

slide-37
SLIDE 37

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

slide-38
SLIDE 38

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

Samples

  • LJSpeech
  • MUS(HRA) listening test
  • 24 paid native speakers
  • 60 Harvard sentences
slide-39
SLIDE 39

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r not significant D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

  • LJSpeech
  • MUS(HRA) listening test
  • 24 paid native speakers
  • 60 Harvard sentences
slide-40
SLIDE 40

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r not significant D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

Main findings

slide-41
SLIDE 41

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

not significant

Naturalness rating

  • 1. A learned front-end (text

encoder) always improves quality (with or without 
 attention, but more so with attention)

  • 2. Acoustic feedback has a

very strong beneficial impact

  • n quality
  • 3. Attention ‘breaks’ without a 


learned front-end and helps
 (but not significantly) with a 
 learned front-end

Main findings

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

slide-42
SLIDE 42

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

not significant

Naturalness rating

  • 1. A learned front-end (text

encoder) always improves quality (with or without 
 attention, but more so with attention)

  • 2. Acoustic feedback has a

very strong beneficial impact

  • n quality
  • 3. Attention ‘breaks’ without a 


learned front-end and helps
 (but not significantly) with a 
 learned front-end

Main findings

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

slide-43
SLIDE 43

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r not significant D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

  • 1. A learned front-end (text

encoder) always improves quality (with or without 
 attention, but more so with attention)

  • 2. Acoustic feedback has a

very strong beneficial impact

  • n quality
  • 3. Attention ‘breaks’ without a 


learned front-end and helps
 (but not significantly) with a 
 learned front-end

Naturalness rating

Main findings

slide-44
SLIDE 44

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

slide-45
SLIDE 45

Normalised subjective rating M MM W2 W2T W2H G2 G1 G1H G1TH G1HAG1THA 10 20 30 40 50 60 70 80 90 100

Naturalness rating

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

slide-46
SLIDE 46

2 G1 G1H G1TH G1HAG1THA Normalised subjective rating M 10 20 30 40 50 60 70 80 90 100

M e r l i n D C T T S

D C T T S m i n u s t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

D C T T S m i n u s a t t e n t i

  • n

a n d t e x t e n c

  • d

e r D C T T S m i n u s a t t e n t i

  • n

, t e x t e n c

  • d

e r 
 a n d a c

  • u

s t i c h i s t

  • r

y

Naturalness rating

slide-47
SLIDE 47

Text encoder Attention Autoregression ✔ ❓ ✔

slide-48
SLIDE 48

Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao