Parallel and cascaded deep neural networks for text-to-speech - - PowerPoint PPT Presentation

parallel and cascaded deep neural networks for text to
SMART_READER_LITE
LIVE PREVIEW

Parallel and cascaded deep neural networks for text-to-speech - - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9


slide-1
SLIDE 1

Parallel and cascaded deep neural networks for text-to-speech synthesis

  • M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi

School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9 - Sunnyvale, United States

1 / 36

slide-2
SLIDE 2

Introduction

  • Speech synthesis and Prosody
  • Synthetic speech may sound bland and monotonous
  • A good understanding and modelling of prosody is essential for

natural speech synthesis.

  • Prosody is inherently suprasegmental
  • Suprasegmental features are mostly associated with long-term

variation.

  • Current features are very shallow (positional and POS/stress

related)

  • Most systems operate at frame/state levels and rely heavily on

segmental features.

Ideally we would have a framework that has good representations

  • f contexts, but also the ability to exploit them.

2 / 36

slide-3
SLIDE 3

Introduction

  • Speech synthesis and Prosody
  • Synthetic speech may sound bland and monotonous
  • A good understanding and modelling of prosody is essential for

natural speech synthesis.

  • Prosody is inherently suprasegmental
  • Suprasegmental features are mostly associated with long-term

variation.

  • Current features are very shallow (positional and POS/stress

related)

  • Most systems operate at frame/state levels and rely heavily on

segmental features.

Ideally we would have a framework that has good representations

  • f contexts, but also the ability to exploit them.

3 / 36

slide-4
SLIDE 4

Earlier work

  • Hierarchical models
  • Cascaded and parallel deep neural networks
  • Superpositional model of f0 [Yin et al (2016)]
  • Systems with hierarchical recurrences [Chen et al (1998)]
  • Continuous representations of linguistic contexts
  • Segmental-level [Lu et al (2013)] [Wu et al (2015)]
  • Word-level [Watts et al (2014)] [Wang et al (2015)]
  • Sentence-level [Watts et al (2015)]

Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016

4 / 36

slide-5
SLIDE 5

Earlier work

  • Hierarchical models
  • Cascaded and parallel deep neural networks
  • Superpositional model of f0 [Yin et al (2016)]
  • Systems with hierarchical recurrences [Chen et al (1998)]
  • Continuous representations of linguistic contexts
  • Segmental-level [Lu et al (2013)] [Wu et al (2015)]
  • Word-level [Watts et al (2014)] [Wang et al (2015)]
  • Sentence-level [Watts et al (2015)]

Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016

5 / 36

slide-6
SLIDE 6

Ribeiro et al (2016)

Contributions

1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at

syllable and word-level Main Findings

1 Hierarchical approach performs best when segmental and

suprasegmental features are balanced.

2 Syllable-bag of phones give minor improvements on objective

scores

3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but

clear differences in terms of predicted f0 contours.

6 / 36

slide-7
SLIDE 7

Ribeiro et al (2016)

Contributions

1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at

syllable and word-level Main Findings

1 Hierarchical approach performs best when segmental and

suprasegmental features are balanced.

2 Syllable-bag of phones give minor improvements on objective

scores

3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but

clear differences in terms of predicted f0 contours.

7 / 36

slide-8
SLIDE 8

Ribeiro et al (2016)

  • Most improvements derive from the hierarchical framework
  • This suggests it is working mostly as a feature extractor or

denoiser

Parallel and cascaded deep neural networks for text-to-speech synthesis

Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016.

8 / 36

slide-9
SLIDE 9

Baseline Network

  • Feedforward deep neural

network

  • 6-hidden layers, each with

1024 nodes

  • Output features
  • 60-dimensional MCCs, 25

band aperiodicities, 1 log-f0, 1 voicing decision (plus dynamic features)

... ...

input features frame-level acoustic parameters

...

9 / 36

slide-10
SLIDE 10

Hierarchical Networks

  • Input features
  • Segmental: phone-level and below
  • Suprasegmental: syllable-level and above
  • Output features
  • Frame-level acoustic parameters averaged over the entire

syllable

  • Architecture
  • 6-hidden layer triangular networks
  • Top hidden layer used as bottleneck layer
  • Integration strategies
  • Cascaded strategy
  • Parallel strategy

10 / 36

slide-11
SLIDE 11

Cascaded Network

... ... ... ... ...

syllable-level acoustic parameters suprasegmental features segmental features hidden representation frame-level acoustic parameters

...

11 / 36

slide-12
SLIDE 12

Parallel Network

... ... ... ... ...

syllable-level acoustic parameters suprasegmental features segmental features frame-level acoustic parameters

... ...

frame-level acoustic parameters 12 / 36

slide-13
SLIDE 13

Linguistic Features

  • Segmental-Features
  • Constant for all systems
  • Phone and state-level features (352 dimensions)
  • Suprasegmental - Full Set
  • Standard set of features used for HMM-based speech synthesis
  • Derived from a common Front-End - Festival
  • Syllable, word, phrase, utterance (roughly 1100 dimensions)
  • Suprasegmental - Pruned Set
  • Hand-selected set of features for DNN-based speech synthesis
  • Higher-level context was removed
  • Syllable, word (244 dimensions)

13 / 36

slide-14
SLIDE 14

Database

  • Expressive audiobook data
  • Ideal for exploring higher-level prosodic phenomena
  • A Tramp Abroad, available from Librivox, processed according

to

  • [Braunschweiler et al (2010)]
  • [Braunschweiler and Buchholz (2011)]
  • Training, development, and test sets consisting of 4500, 300,

100 utterances, respectively.

14 / 36

slide-15
SLIDE 15

Systems

  • 3 network architectures,
  • 2 sets of linguistic features
  • 6 systems trained

1 Baseline - Hand-selected 2 Cascaded - Hand-selected 3 Parallel - Hand-selected 4 Baseline - Standard 5 Cascaded - Standard 6 Parallel - Standard

15 / 36

slide-16
SLIDE 16

Hypotheses

Addition of noisy suprasegmental features

  • Adding more (suprasegmental) features to a frame-level model

will degrade its performance Hierarchical Systems

  • Hierarchical systems will outperform non-hierarchical systems
  • Previous work has suggested hierarchical systems are

beneficial for speech synthesis Parallel and cascaded networks

  • Parallel architectures will be preferred over cascaded

architectures

16 / 36

slide-17
SLIDE 17

Hypotheses

Addition of noisy suprasegmental features

  • Adding more (suprasegmental) features to a frame-level model

will degrade its performance Hierarchical Systems

  • Hierarchical systems will outperform non-hierarchical systems
  • Previous work has suggested hierarchical systems are

beneficial for speech synthesis Parallel and cascaded networks

  • Parallel architectures will be preferred over cascaded

architectures

17 / 36

slide-18
SLIDE 18

Hypotheses

Addition of noisy suprasegmental features

  • Adding more (suprasegmental) features to a frame-level model

will degrade its performance Hierarchical Systems

  • Hierarchical systems will outperform non-hierarchical systems
  • Previous work has suggested hierarchical systems are

beneficial for speech synthesis Parallel and cascaded networks

  • Parallel architectures will be preferred over cascaded

architectures

18 / 36

slide-19
SLIDE 19

Listening tests

  • MUSHRA test
  • MUltiple Stimuli with Hidden Reference and Anchor
  • Simultaneous comparison of multiple speech samples
  • Listeners rank each system against all conditions and against a

reference

  • Test setup
  • 20 native English listeners
  • Each rate 20 sets of stimuli
  • Total of 400 parallel comparisons

19 / 36

slide-20
SLIDE 20

Results

20 / 36

slide-21
SLIDE 21

Results - additional features

21 / 36

slide-22
SLIDE 22

Results - additional features

22 / 36

slide-23
SLIDE 23

Results - hand-selected features

23 / 36

slide-24
SLIDE 24

Results - hand-selected features

24 / 36

slide-25
SLIDE 25

Results - hand-selected features

25 / 36

slide-26
SLIDE 26

Results - standard features

26 / 36

slide-27
SLIDE 27

Results - standard feature set

27 / 36

slide-28
SLIDE 28

Results - standard features

28 / 36

slide-29
SLIDE 29

Results - parallel networks

29 / 36

slide-30
SLIDE 30

Results - parallel networks

30 / 36

slide-31
SLIDE 31

Speech Samples

speech samples

31 / 36

slide-32
SLIDE 32

Summary

Main Findings

1 Adding high-dimensional representations of context to

frame-level network may be harmful

2 Hierarchical systems (parallel or cascaded) can be useful if

using noisy suprasegmental features

  • This suggests it may be operating as a feature extractor or

denoiser

3 Parallel networks outperform cascaded networks in all cases

  • Consistent with findings of [Yin et al (2016)], although tested

under different circumstances

32 / 36

slide-33
SLIDE 33

Future work

  • Explore parallel approach with additional features
  • Syllable bag-of-phones, text-based word embeddings

[Ribeiro et al (2016)]

  • Can these frameworks leverage new information?
  • Decoupling of linguistic-levels with parallel approach (similar

to [Yin et al (2016)])

  • Hierarchical systems with recurrent layers
  • Alternative acoustic features for suprasegmental network

33 / 36

slide-34
SLIDE 34

Parallel and cascaded deep neural networks for text-to-speech synthesis

Thank you for listening

  • M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi

School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk

14 September 2015 - Sunnyvale, California, United States

34 / 36

slide-35
SLIDE 35

References I

[Braunschweiler and Buchholz (2011)] Braunschweiler, N. & Buchholz, S. (2011) Automatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality. Interspeech 2011 [Braunschweiler et al (2010)] Braunschweiler, N., Gales, M.J.F. & Buchholz, S. (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. Interspeech 2010 [Chen et al (1998)] Chen, S.H., Hwang, S.H. & Wang, Y.R. (1998) An RNN-based prosodic information synthesizer for Mandarin text-to-speech IEEE Transactions on Speech and Audio Processing [Lu et al (2013)] Lu, H., King, S. & Watts, O. (2013) Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis Speech Synthesis Workshop 8 (2013) [Ribeiro et al (2016)] Ribeiro, M.S., Watts, O. & Yamagishi, J. (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech. San Francisco, 2016 [Wang et al (2015)] Wang, P, Qian, Y., Soong, F.K, He, L. & Zhao, H. (2015) Word Embedding for Recurrent Neural Network Based TTS Synthesis ICASSP 2015 [Watts et al (2014)] Watts, O., Gangireddy, S., Yamagishi, K., King, S., Renals, S., Stan, A. & Giurgiu, M. (2014) Neural net word representations for phrase-break prediction without a part of speech tagger ICASPP 2014 35 / 36

slide-36
SLIDE 36

References II

[Watts et al (2015)] Watts, O., Wu, Z & King, S. (2015) Sentence-level control vectors for deep neural network speech synthesis Interspeech 2015 [Wu et al (2015)] Wu, Z., Valentini-Botinhao, C., Watts, O. & King, S. (2015) Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis ICASPP 2015 [Yin et al (2016)] Yin, X., Lei, M., Qian, Y., Soong, F., He, L., Ling, Z.H. & Dai, L.R. (2016) Modeling F0 trajectories in hierarchically structured deep neural networks Speech Communication 76. 36 / 36