parallel and cascaded deep neural networks for text to
play

Parallel and cascaded deep neural networks for text-to-speech - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9


  1. Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9 - Sunnyvale, United States 1 / 36

  2. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 2 / 36

  3. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 3 / 36

  4. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 4 / 36

  5. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 5 / 36

  6. Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 6 / 36

  7. Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 7 / 36

  8. Ribeiro et al (2016) • Most improvements derive from the hierarchical framework • This suggests it is working mostly as a feature extractor or denoiser Parallel and cascaded deep neural networks for text-to-speech synthesis Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016. 8 / 36

  9. Baseline Network frame-level acoustic parameters • Feedforward deep neural network ... • 6-hidden layers, each with 1024 nodes ... • Output features • 60-dimensional MCCs, 25 band aperiodicities, 1 ... log-f0, 1 voicing decision (plus dynamic features) input features 9 / 36

  10. Hierarchical Networks • Input features • Segmental : phone-level and below • Suprasegmental: syllable-level and above • Output features • Frame-level acoustic parameters averaged over the entire syllable • Architecture • 6-hidden layer triangular networks • Top hidden layer used as bottleneck layer • Integration strategies • Cascaded strategy • Parallel strategy 10 / 36

  11. Cascaded Network syllable-level acoustic parameters frame-level acoustic parameters ... ... ... ... ... ... segmental features hidden representation suprasegmental features 11 / 36

  12. Parallel Network frame-level acoustic parameters ... frame-level acoustic parameters syllable-level acoustic parameters ... ... ... ... ... ... segmental features suprasegmental features 12 / 36

  13. Linguistic Features • Segmental-Features • Constant for all systems • Phone and state-level features (352 dimensions) • Suprasegmental - Full Set • Standard set of features used for HMM-based speech synthesis • Derived from a common Front-End - Festival • Syllable, word, phrase, utterance (roughly 1100 dimensions) • Suprasegmental - Pruned Set • Hand-selected set of features for DNN-based speech synthesis • Higher-level context was removed • Syllable, word (244 dimensions) 13 / 36

  14. Database • Expressive audiobook data • Ideal for exploring higher-level prosodic phenomena • A Tramp Abroad , available from Librivox, processed according to • [Braunschweiler et al (2010)] • [Braunschweiler and Buchholz (2011)] • Training, development, and test sets consisting of 4500, 300, 100 utterances, respectively. 14 / 36

  15. Systems • 3 network architectures, • 2 sets of linguistic features • 6 systems trained 1 Baseline - Hand-selected 2 Cascaded - Hand-selected 3 Parallel - Hand-selected 4 Baseline - Standard 5 Cascaded - Standard 6 Parallel - Standard 15 / 36

  16. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 16 / 36

  17. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 17 / 36

  18. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 18 / 36

  19. Listening tests • MUSHRA test • MUltiple Stimuli with Hidden Reference and Anchor • Simultaneous comparison of multiple speech samples • Listeners rank each system against all conditions and against a reference • Test setup • 20 native English listeners • Each rate 20 sets of stimuli • Total of 400 parallel comparisons 19 / 36

  20. Results 20 / 36

  21. Results - additional features 21 / 36

  22. Results - additional features 22 / 36

  23. Results - hand-selected features 23 / 36

  24. Results - hand-selected features 24 / 36

  25. Results - hand-selected features 25 / 36

  26. Results - standard features 26 / 36

  27. Results - standard feature set 27 / 36

  28. Results - standard features 28 / 36

  29. Results - parallel networks 29 / 36

  30. Results - parallel networks 30 / 36

  31. Speech Samples speech samples 31 / 36

  32. Summary Main Findings 1 Adding high-dimensional representations of context to frame-level network may be harmful 2 Hierarchical systems (parallel or cascaded) can be useful if using noisy suprasegmental features • This suggests it may be operating as a feature extractor or denoiser 3 Parallel networks outperform cascaded networks in all cases • Consistent with findings of [Yin et al (2016)], although tested under different circumstances 32 / 36

  33. Future work • Explore parallel approach with additional features • Syllable bag-of-phones, text-based word embeddings [Ribeiro et al (2016)] • Can these frameworks leverage new information? • Decoupling of linguistic-levels with parallel approach (similar to [Yin et al (2016)]) • Hierarchical systems with recurrent layers • Alternative acoustic features for suprasegmental network 33 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend