Cyborg speech: Deep multilingual speech synthesis for generating - PowerPoint PPT Presentation

Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody Gustav Eje Henter 1 , Jaime Lorenzo-Trueba 1 , Xin Wang 1 , Mariko Kondo 2 , Junichi Yamagishi 1,3 gustav@nii.ac.jp, jyamagis@nii.ac.jp 1 National Institute of Informatics, Tokyo, Japan 2 Waseda University, Tokyo, Japan 3 The University of Edinburgh, Edinburgh, UK 2018-04-18 Henter et al. Cyborg speech 2018-04-18 1 / 28

Synopsis • We generate foreign-accented synthetic speech audio • . . . with native prosody • . . . having finely controllable accent • . . . as a new application of deep-learning-based speech synthesis • . . . using multilingual techniques • . . . from non-accented speech data alone Henter et al. Cyborg speech 2018-04-18 2 / 28

Overview 1. Introduction 2. Method 3. Experimental validation 3.1 Setup 3.2 Evaluation and results 4. Conclusion Henter et al. Cyborg speech 2018-04-18 3 / 28

Studying foreign accent What makes speech sound foreign-accented? • A question of speech perception research • Empirical method: Measure how listeners respond to speech stimuli with carefully controlled differences • Useful knowledge for improving foreign-language instruction Henter et al. Cyborg speech 2018-04-18 4 / 28

Cues to foreign accent What makes speech sound foreign-accented? • Supra-segmental properties • Intonation and pauses (Kang et al., 2010) • Nuclear stress (Hahn, 2004) • Duration (Tajima et al., 1997) • Speech rate (Munro and Derwing, 2001) • And more. . . • Segmental properties • Pronunciation errors • Listeners often consider this the most important aspect! (Derwing and Munro, 1997) • Worthwhile to correct even if not Henter et al. Cyborg speech 2018-04-18 5 / 28

Studying segmental foreign accent • Need speech stimuli isolating and interpolating segmental effects • Only specific segments should be affected • Without supra-segmental effects Henter et al. Cyborg speech 2018-04-18 6 / 28

Studying segmental foreign accent • Need speech stimuli isolating and interpolating segmental effects • Only specific segments should be affected • Without supra-segmental effects • Method 1: Record deliberate mispronunciations • Difficult/impossible to elicit Henter et al. Cyborg speech 2018-04-18 6 / 28

Studying segmental foreign accent • Need speech stimuli isolating and interpolating segmental effects • Only specific segments should be affected • Without supra-segmental effects • Method 1: Record deliberate mispronunciations • Difficult/impossible to elicit • Method 2: Cross-language splicing • Labour-intensive manual work • Artefacts at joins Henter et al. Cyborg speech 2018-04-18 6 / 28

Studying segmental foreign accent • Need speech stimuli isolating and interpolating segmental effects • Only specific segments should be affected • Without supra-segmental effects • Method 1: Record deliberate mispronunciations • Difficult/impossible to elicit • Method 2: Cross-language splicing • Labour-intensive manual work • Artefacts at joins • Method 3: Synthesise stimuli • Data-driven, automated approach • No joins • New tool; unusual application of speech synthesis Henter et al. Cyborg speech 2018-04-18 6 / 28

Our approach • Methods for synthesising foreign-accented stimuli • Multilingual HMM-based TTS (García Lecumberri et al., 2014) • Multilingual deep learning (this presentation!) • We improve on (García Lecumberri et al., 2014) in two ways: Henter et al. Cyborg speech 2018-04-18 7 / 28

Our approach • Methods for synthesising foreign-accented stimuli • Multilingual HMM-based TTS (García Lecumberri et al., 2014) • Multilingual deep learning (this presentation!) • We improve on (García Lecumberri et al., 2014) in two ways: • Improvement 1: Deep learning • Improved signal quality (Watts et al., 2016), meaning it better replicates the perceptual cues in natural speech • Enables easy control of the output synthesis (Watts et al., 2015; Luong et al., 2017) Henter et al. Cyborg speech 2018-04-18 7 / 28

Our approach • Methods for synthesising foreign-accented stimuli • Multilingual HMM-based TTS (García Lecumberri et al., 2014) • Multilingual deep learning (this presentation!) • We improve on (García Lecumberri et al., 2014) in two ways: • Improvement 1: Deep learning • Improved signal quality (Watts et al., 2016), meaning it better replicates the perceptual cues in natural speech • Enables easy control of the output synthesis (Watts et al., 2015; Luong et al., 2017) • Improvement 2: Use reference prosody (pitch and duration) • Can be taken from natural speech, or predicted by a separate system • Allows us to impose native-like suprasegmental properties Henter et al. Cyborg speech 2018-04-18 7 / 28

Building the synthesiser Traditional text-to-speech: Text Quinphones MGCs Text analysis Speech Other Acoustic BAPs Vocoder features model Duration Durations F0, VUV model Henter et al. Cyborg speech 2018-04-18 9 / 28

Building the synthesiser Speech synthesis with arbitrary prosody: Text Quinphones MGCs Text Acoustic analysis model Speech Other Durations BAPs Vocoder features Prosody generator F0, VUV Henter et al. Cyborg speech 2018-04-18 9 / 28

Building the synthesiser Speech synthesis with natural prosody: Text Quinphones MGCs Text Acoustic analysis Other model BAPs Speech Durations Vocoder Speech features analysis Natural F0, VUV + HTK speech Henter et al. Cyborg speech 2018-04-18 9 / 28

Building the synthesiser Speech synthesis with natural prosody: Text Quinphones MGCs Text Acoustic Machine Partially analysis model Other BAPs human Durations Vocoder Speech features speech Human analysis Natural F0, VUV + HTK speech Henter et al. Cyborg speech 2018-04-18 9 / 28

“Cyborg speech” Henter et al. Cyborg speech 2018-04-18 10 / 28

“Cyborg speech” • Cyborg: A being with both organic and biomechatronic body parts • Our acoustic parameters are a combination of man and machine Henter et al. Cyborg speech 2018-04-18 10 / 28

Making it foreign • Segmental foreign accent through multilingual speech synthesis: • Teach a single model to synthesise several languages natively • During synthesis, interpolate specific phones in the spoken language towards phones in the accent language • Maintain the same voice across languages • In this case by using data from a multilingually native speaker Henter et al. Cyborg speech 2018-04-18 11 / 28

Making it foreign • Segmental foreign accent through multilingual speech synthesis: • Teach a single model to synthesise several languages natively • During synthesis, interpolate specific phones in the spoken language towards phones in the accent language • Maintain the same voice across languages • In this case by using data from a multilingually native speaker • Running example: American English and Japanese • Combilex GAM (Richmond et al., 2009): 54 English phones • Open JTalk (Oura et al., 2010): 44 Japanese phones • Combined, bilingual phoneset: 54 + 44 = 98 phones Henter et al. Cyborg speech 2018-04-18 11 / 28

Synthesising foreign accent Cyborg speech: Text Quinphones MGCs Text Acoustic analysis model Speech Other Durations BAPs Vocoder Speech features analysis Natural F0, VUV + HTK speech Henter et al. Cyborg speech 2018-04-18 12 / 28

Synthesising foreign accent Bilingual cyborg speech synthesis: Language flag DBLSTM Text Bilingual quinphones bilingual MGCs Language- acoustic dependent model text Other BAPs Bilingual Durations Vocoder Speech analysis speech features analysis Native F0, VUV + HTK speech Henter et al. Cyborg speech 2018-04-18 12 / 28

Synthesising foreign accent Foreign-accented speech synthesis: Language flag CONTROL DBLSTM Text Bilingual quinphones bilingual MGCs Language- acoustic dependent model text Other BAPs Accented Durations Vocoder Speech analysis speech features analysis Native F0, VUV + HTK speech Henter et al. Cyborg speech 2018-04-18 12 / 28

Data and processing • Male voice talent native in both US English and Japanese • 2000 utterances per language • US English example • Japanese example • 20 pre-recorded test utterances in each language • Source of reference pitch and durations • 48 kHz at 16 bits Henter et al. Cyborg speech 2018-04-18 14 / 28

Data and processing • Male voice talent native in both US English and Japanese • 2000 utterances per language • US English example • Japanese example • 20 pre-recorded test utterances in each language • Source of reference pitch and durations • 48 kHz at 16 bits • WORLD vocoder (Morise et al., 2016) • Forced alignment using HTS (Zen et al., 2007) • Separate systems for each language Henter et al. Cyborg speech 2018-04-18 14 / 28

Cyborg speech: Deep multilingual speech synthesis for generating - PowerPoint PPT Presentation

Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody Gustav Eje Henter 1 , Jaime Lorenzo-Trueba 1 , Xin Wang 1 , Mariko Kondo 2 , Junichi Yamagishi 1,3 gustav@nii.ac.jp, jyamagis@nii.ac.jp

Steve Mann: Half Cyborg, Half Badass Intro Steve Mann is considered by many to be the world's

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Martins Axiom and Choice Principles Eleftherios Tachtsis Department of Mathematics University

Alternative Data in Finance What is Alternative Data in Finance? Example: Lodging Key Metrics

Fast Timing via Cerenkov Radiation Earle Wilson, Advisor: Hans Wenzel Fermilab August 5, 2009

Some Problems and Generalizations on Erd os-Ko-Rado Theorem Huajun Zhang Department of

Matthew Series Lesson #152 January 29, 2017 Dean Bible Ministries www.deanbibleministries.org Dr.

Whats wrong with WebSocket APIs? Unveiling vulnerabilities in WebSocket APIs Mikhail Egorov /

Ethnic Fragmentation, Public Good Provision and Inequality in India, 1988 - 2012 Nishant Chadha 1

Direct influences vs contextuality in human choices Vctor H. Cervantes 1 , Ehtibar N. Dzhafarov 2

Sambuz

Useful Links

Newsletter

Mail Us