Syllable-level representations of suprasegmental features for - - PowerPoint PPT Presentation

syllable level representations of suprasegmental features
SMART_READER_LITE
LIVE PREVIEW

Syllable-level representations of suprasegmental features for - - PowerPoint PPT Presentation

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016


slide-1
SLIDE 1

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

  • M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi

School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016 INTERSPEECH - San Francisco, United States

1 / 22

slide-2
SLIDE 2

Introduction

  • Speech synthesis and Prosody
  • Synthetic speech may sound bland and monotonous
  • A good understanding and modelling of prosody is essential for

natural speech synthesis.

  • Prosody is inherently suprasegmental
  • Suprasegmental features are mostly associated with long-term

variation.

  • Current features are very shallow (positional and POS/stress

related)

  • Most systems operate at frame/state levels and rely heavily on

segmental features.

Idea: Learn suprasegmental representations by pre-processing higher-level features separately.

2 / 22

slide-3
SLIDE 3

Introduction

  • Speech synthesis and Prosody
  • Synthetic speech may sound bland and monotonous
  • A good understanding and modelling of prosody is essential for

natural speech synthesis.

  • Prosody is inherently suprasegmental
  • Suprasegmental features are mostly associated with long-term

variation.

  • Current features are very shallow (positional and POS/stress

related)

  • Most systems operate at frame/state levels and rely heavily on

segmental features.

Idea: Learn suprasegmental representations by pre-processing higher-level features separately.

3 / 22

slide-4
SLIDE 4

Earlier work

  • Hierarchical models
  • Cascaded and parallel deep neural networks [Yin et al (2016)],

[Ribeiro et al (2016)]

  • Systems with hierarchical recurrences [Chen et al (1998)]
  • Continuous representations of linguistic contexts
  • Segmental-level [Lu et al (2013)] [Wu et al (2015)]
  • Word-level [Watts et al (2014)] [Wang et al (2015)]
  • Sentence-level [Watts et al (2015)]

Contributions

  • A top-down hierarchical model at syllable-level (cascaded)
  • An investigation of additional features at syllable and

word-level

4 / 22

slide-5
SLIDE 5

Earlier work

  • Hierarchical models
  • Cascaded and parallel deep neural networks [Yin et al (2016)],

[Ribeiro et al (2016)]

  • Systems with hierarchical recurrences [Chen et al (1998)]
  • Continuous representations of linguistic contexts
  • Segmental-level [Lu et al (2013)] [Wu et al (2015)]
  • Word-level [Watts et al (2014)] [Wang et al (2015)]
  • Sentence-level [Watts et al (2015)]

Contributions

  • A top-down hierarchical model at syllable-level (cascaded)
  • An investigation of additional features at syllable and

word-level

5 / 22

slide-6
SLIDE 6

Database

  • Database
  • Expressive audiobook data
  • Ideal for exploring higher-level prosodic phenomena
  • A Tramp Abroad, available from Librivox, processed according

to [Braunschweiler et al (2010)] and [Braunschweiler and Buchholz (2011)].

  • Training, development, and test sets consisting of 4500, 300,

100 utterances.

  • Baseline
  • Feedforward neural network - 6 hidden layers, each with 1024

nodes

  • Output features: 60-dimensional MCCs, 25 band aperiodicities,

1 log-f0, 1 voicing decision (with dynamic features)

  • Input features: linguistic contexts at state, phone, syllable, and

word levels (594 features).

6 / 22

slide-7
SLIDE 7

Hierarchical approach

  • Syllable-level network: triangular feedforward neural network
  • Output features: MCCs, BAPs, continuous log-F0 averaged
  • ver the entire syllable
  • Input features: linguistic contexts defined at syllable and word

levels (suprasegmental features)

... ... ... ... ...

syllable-level acoustic parameters suprasegmental features segmental features hidden representation frame-level acoustic parameters

...

7 / 22

slide-8
SLIDE 8

Embedding size

  • Effect of the bottleneck layer size on objective measures
  • Does it matter if segmental and suprasegmental features are

unbalanced?

bline d32 d64 d128 d256 d512 4.56 4.57 4.58 4.59 4.60

Mel Cepstral Distortion

bline d32 d64 d128 d256 d512 27.0 27.2 27.4 27.6 27.8 28.0 28.2

LF0-RMSE

bline d32 d64 d128 d256 d512 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480

LF0-CORR 8 / 22

slide-9
SLIDE 9

Additional features

  • Hypotheses
  • Hierarchical approaches will be able to leverage additional

suprasegmental features.

  • Frame-level network will depend mostly on segmental features

and ignore the new high-dimensional features.

  • Additional features
  • Syllable bag-of-phones
  • Text-based word embeddings (skip-gram model [Mikolov et al (2013)])

9 / 22

slide-10
SLIDE 10

Syllable bag-of-phones

  • Syllables have a variable number of phones
  • Bag-of-phones allows us to represent them with fixed-sized

units

  • Syllable structure: onset, nucleus, coda
  • For each syllable component, define an n-hot encoding
  • Includes identity and articulatory features for phones

10 / 22

slide-11
SLIDE 11

Syllable bag-of-phones

  • Systems trained
  • frame frame-level DNN
  • frame-BoP frame-level DNN with syllable bag-of-phones
  • syl cascaded DNN
  • syl-BoP cascaded DNN with syllable bag-of-phones

4.55 4.56 4.57 4.58 4.59 4.60

Mel-Cepstral Distortion

2.15 2.16 2.17 2.18 2.19 2.20 2.21

BAP Distortion

frame frame-BoP syl syl-BoP

26.5 27.0 27.5 28.0 28.5

LF0-RMSE

frame frame-BoP syl syl-BoP

0.44 0.45 0.46 0.47 0.48

LF0-CORR

11 / 22

slide-12
SLIDE 12

Word embeddings

  • Text-based word embeddings learned with the Skip-gram

model [Mikolov et al (2013)]

  • English Wikipedia data - 500 million words
  • Embedding size: 100 and 300 dimensions

System MCD BAP F0-RMSE F0-CORR frame 4.596 2.197 28.054 .449 frame-w100 4.598 2.204 28.048 .448 syl-BoP 4.557 2.176 27.095 .477 syl-BoP-w100 4.55 2.177 27.086 .463 syl-BoP-w300 4.565 2.178 26.850 .479

  • No real improvements, although previous work has suggested

these embeddings are useful for text-to-speech. [Wang et al (2015)]

12 / 22

slide-13
SLIDE 13

Subjective evaluation

  • Preference Test
  • 50 test utterances
  • 16 native listeners
  • 400 judgements per

condition

  • Systems evaluated
  • Baseline Basic

feedforward DNN with all available features

  • Syl Top-down hierarchical

system with syllable-bag-of-phones

  • Syl-w300 Adds 300

dimensional word embeddings to syl

ID syl syl-w300 1 48.15% 43.18% 2 60.87% 51.79% 3 59.26% 59.09% 4 56.52% 48.21% 5 53.70% 54.55% 6 63.04% 46.43% 7 38.89% 56.00% 8 56.52% 53.57% 9 44.44% 50.00% 10 65.38% 51.72% 11 42.59% 43.18% 12 52.17% 48.21% 13 55.56% 63.64% 14 65.22% 42.86% 15 46.30% 47.73% 16 45.65% 42.86% all 53.39% 50.19%

13 / 22

slide-14
SLIDE 14

What are listeners responding to?

Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0.

14 / 22

slide-15
SLIDE 15

What are listeners responding to?

Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0.

15 / 22

slide-16
SLIDE 16

Speech Samples

speech samples http://homepages.inf.ed.ac.uk/s1250520/samples/ interspeech16.html

16 / 22

slide-17
SLIDE 17

Summary

Contributions

1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at

syllable and word-level Main Findings

1 Hierarchical approach performs best when segmental and

suprasegmental features are balanced.

2 Syllable-bag of phones give minor improvements on objective

scores

3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but

clear differences in terms of predicted f0 contours.

17 / 22

slide-18
SLIDE 18

Summary

Contributions

1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at

syllable and word-level Main Findings

1 Hierarchical approach performs best when segmental and

suprasegmental features are balanced.

2 Syllable-bag of phones give minor improvements on objective

scores

3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but

clear differences in terms of predicted f0 contours.

18 / 22

slide-19
SLIDE 19

Future work

  • Most improvements derive from the hierarchical framework
  • This suggests it is working mostly as a feature extractor or

denoiser

Parallel and cascaded deep neural networks for text-to-speech synthesis

Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016.

19 / 22

slide-20
SLIDE 20

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

Thank you for listening

  • M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi

School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk

12 September 2015 - San Francisco, United States

20 / 22

slide-21
SLIDE 21

References I

[Braunschweiler and Buchholz (2011)] Braunschweiler, N. & Buchholz, S. (2011) Automatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality. Interspeech 2011 [Braunschweiler et al (2010)] Braunschweiler, N., Gales, M.J.F. & Buchholz, S. (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. Interspeech 2010 [Chen et al (1998)] Chen, S.H., Hwang, S.H. & Wang, Y.R. (1998) An RNN-based prosodic information synthesizer for Mandarin text-to-speech IEEE Transactions on Speech and Audio Processing [Lu et al (2013)] Lu, H., King, S. & Watts, O. (2013) Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis Speech Synthesis Workshop 8 (2013) [Mikolov et al (2013)] Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013) Efficient estimation of word representations in vector space arXiv preprint arXiv:1301.3781 [Ribeiro et al (2016)] Ribeiro, M.S., Watts, O. & Yamagishi, J. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis 9th ISCA Workshop on Speech Synthesis, Proceedings, Sunnyvale, 2016 [Wang et al (2015)] Wang, P, Qian, Y., Soong, F.K, He, L. & Zhao, H. (2015) Word Embedding for Recurrent Neural Network Based TTS Synthesis ICASSP 2015 21 / 22

slide-22
SLIDE 22

References II

[Watts et al (2014)] Watts, O., Gangireddy, S., Yamagishi, K., King, S., Renals, S., Stan, A. & Giurgiu, M. (2014) Neural net word representations for phrase-break prediction without a part of speech tagger ICASPP 2014 [Wu et al (2015)] Wu, Z., Valentini-Botinhao, C., Watts, O. & King, S. (2015) Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis ICASPP 2015 [Watts et al (2015)] Watts, O., Wu, Z & King, S. (2015) Sentence-level control vectors for deep neural network speech synthesis Interspeech 2015 [Yin et al (2016)] Yin, X., Lei, M., Qian, Y., Soong, F., He, L., Ling, Z.H. & Dai, L.R. (2016) Modeling F0 trajectories in hierarchically structured deep neural networks Speech Communication 76. 22 / 22