End-to-end approach to ASR, TTS and Speech Translation Satoshi - PowerPoint PPT Presentation

End-to-end approach to ASR, TTS and Speech Translation Satoshi Nakamura 1,2 with Sakriani Sakti 1,2 , Andros Tjandra 1,2 ,Takatomo Kano, and Quoc Truong Do 1 Nara Institute of Science & Technology, Japan 2 RIKEN, Center for Advanced Intelligence Project AIP, Japan 1 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Outline • Machine Speech Chain • Machine Speech Chain: Listening while speaking • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, ASRU 2017 • Spe Speech Cha hain in with ith One ne-shot Spe Speaker r Ada daptation • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura1,2 “Machine Speech Chain with One -shot Speaker Adaptation”, Proceedings of INTERSPEECH 2018 • End nd-to to-end Fee eedback Lo Loss in in Spe Speech Cha hain in Fram ramework rk via ia Stra Straight-thro rough Estim imator • A. Tjandra, S. Sakti, S. Nakamura, "End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator", in Proc. ICASSP, 2019 • End-to to-end Sp Speech-to to-speech Transla latio ion • Str Structu ture Ba Based Curr rriculum Lea Learning for End-to to-end English lish-Japanese Sp Spee eech Tra ranslati tion • Takatomo Kano, Sakriani Sakti, Satoshi Nakamura, “Structure Based Curriculum Learning for End - to-end English- Japanese Speech Translation”, INTERSPEECH2017 2 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Outline • Machine Speech Chain • Machine Speech Chain: Listening while speaking • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, ASRU 2017 • Spe Speech Cha hain in with ith One ne-shot Spe Speaker r Ada daptation • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura1,2 “Machine Speech Chain with One -shot Speaker Adaptation”, Proceedings of INTERSPEECH 2018 • End nd-to to-end Fee eedback Lo Loss in in Spe Speech Cha hain in Fram ramework rk via ia Stra Straight-thro rough Estim imator • A. Tjandra, S. Sakti, S. Nakamura, "End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator", in Proc. ICASSP, 2019 • End-to to-end Sp Speech-to to-speech Transla latio ion • Str Structu ture Ba Based Curr rriculum Lea Learning for End-to to-end English lish-Japanese Sp Spee eech Tra ranslati tion • Takatomo Kano, Sakriani Sakti, Satoshi Nakamura, “Structure Based Curriculum Learning for End - to-end English- Japanese Speech Translation”, INTERSPEECH2017 3 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Motivation Background • In human communication → A closed-loop speech chain mechanism has a critical auditory feedback mechanism → Children who lose their hearing often have difficulty to produce clear speech Sensory nerves Auditory feedback Sensory Speaking Listening nerves Motor nerves 4 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Speech Chain: Denes, Pinson 1973 5 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Delayed Auditory ry Feedback *1,2 • DAF: • It can consist of a device that enables a user to speak into a microphone and then hear his or her voice in headphones a fraction of a second later • Effects in people who stutter • Those who stutter had an abnormal speech – auditory feedback loop that was corrected or bypassed while speaking under DAF. • Effects in normal speakers • DAF in non-stutterers to see what it can prove about the structure of the auditory and verbal pathways in the brain. • Indirect effects of delayed auditory feedback in non-stutterers include reduction in rate of speech, increase in intensity, and increase in fundamental frequency in order to overcome the effects of the feedback. Direct effects include repetition of syllables, mispronunciations, omissions, and omitted word endings. *1 Bernard S. Lee, “ Delayed Speech Feedback”, The Journal of the Acoustical Society of America 22 , 824 (1950); *2 Wikipedia “Delayed Auditory Feedback” 6 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Human-Machine Interaction  Modality in Human-Machine Interaction  Providing a technology with ability to listen and speak “ Good afternoon ” “ How are you? ” Sensory nerves Auditory feedback Speech Sensory recognition nerves Speaking Listening “ Good afternoon ” Motor nerves Recognized words Speaking Speech Synthesis 7 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Machine Speech Chain  Proposed Method  Develop a closed-loop speech chain model based on deep learning  The first deep learning model that integrates human speech perception & production behaviors Not only has the capability to listen and speak, “ Good afternoon ” but also listen while speaking Sensory nerves Auditory feedback “ How are you? ” Speaking Motor nerves Auditory feedback Speaking 8 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Motivation Background • Despite the close relationship between speech perception & production → ASR and TTS researches have progressed independently Property ASR TTS Speech features MFCC MGC Mel-fbank log F0, Voice/Unvoice, BAP Text features Phoneme Phoneme + POS + LEX + … Character (Full context label) Model GMM-HMM GMM-HSMM Hybrid DNN/HMM DNN-HSMM End-to-end ASR End-to-end TTS 10 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Machine Speech Chain • Definition: • 𝑦 = original speech, 𝑧 = original text • ො 𝑦 = predicted speech, ො 𝑧 = predicted text • 𝐵𝑇𝑆(𝑦): 𝑦 → ො 𝑧 (seq2seq model transforms speech to text) • 𝑈𝑈𝑇 𝑧 : 𝑧 → ො 𝑦 (seq2seq model transforms text to speech) 11 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Machin ine Speech Chain in Case #1: Supervis ised Learnin ing wit ith Speech-Text xt Data • Given a pair speech-text 𝒚, 𝒛 • Train ASR and TTS in supervised learning • Directly optimized: → 𝐵𝑇𝑆 by minimize ℒ 𝐵𝑇𝑆 𝑧, ො 𝑧 → 𝑈𝑈𝑇 by minimizing loss between ℒ 𝑈𝑈𝑇 𝑦, ො 𝑦 • Update both ASR and TTS independently 12 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Machin ine Speech Chain in Case #2: Unsupervised Learning wit ith Text xt Only ly • Given the unlabeled text features 𝒛 1. TTS generates speech features ො 𝑦 2. Based on ො 𝑦 , ASR tries to reconstruct text features ො 𝑧 3. Calculate ℒ 𝐵𝑇𝑆 (𝑧, ො 𝑧) between original text features 𝑧 and the predicted ො 𝑧 Possible to improve ASR with text only by the support of TTS 13 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Machin ine Speech Chain in Case #3: Unsupervised Learning wit ith Speech Only ly • Given the unlabeled speech features 𝒚 1. ASR predicts the most possible transcription ො 𝑧 2. Based on ො 𝑧 , TTS tries to reconstruct speech features ො 𝑦 3. Calculate ℒ 𝑈𝑈𝑇 (𝑦, ො 𝑦) between original speech features 𝑦 and the predicted ො 𝑦 Possible to improve TTS with speech only by the support of ASR 14 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Sequence-to to-Sequence ASR Input & output • 𝒚 = [𝑦 1 , … , 𝑦 𝑇 ] (speech feature) • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 • ℎ 𝑢 • 𝑏 𝑢 = attention probability at time t 𝑓 , ℎ 𝑢 𝑒 • 𝑏 𝑢 𝑡 = 𝐵𝑚𝑗𝑕𝑜 ℎ 𝑡 𝑓 ,ℎ 𝑢 𝑒 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 • 𝑏 𝑢 𝑡 = 𝑇 𝑓 ,ℎ 𝑢 𝑒 σ 𝑡=1 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 𝑓 (expected context) 𝑇 • 𝑑 𝑢 = σ 𝑡=1 𝑏 𝑢 𝑡 ∗ ℎ 𝑡 Loss function 𝑈 ℒ 𝐵𝑇𝑆 𝑧, 𝑞 𝑧 = − 1 𝑈 ෍ ෍ 1(𝑧 𝑢 = 𝑑) ∗ log 𝑞 𝑧 𝑢 [𝑑] 𝑢=1 𝑑∈[1..𝐷] 15 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Sequence-to to-Sequence TTS Input & output 𝒚 𝑺 = [𝑦 1 , … , 𝑦 𝑇 ] (linear spectrogram feature) • 𝒚 𝑵 = [𝑦 1 , … , 𝑦 𝑇 ] (mel spectrogram feature) End of speech • prediction • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 Fully connected • ℎ 𝑡 • 𝑏 𝑡 = attention probability at time t 𝑓 (expected context) 𝑇 • 𝑑 𝑡 = σ 𝑡=1 𝑏 𝑡 𝑢 ∗ ℎ 𝑢 Loss function 𝑇 𝑦 = 1 𝑁 − ො 𝑁 2 + 𝑦 𝑡 𝑆 − ො 𝑆 2 ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑇 ෍ 𝑦 𝑡 𝑦 𝑡 𝑦 𝑡 𝑡=1 Fully connected 𝑇 𝑐 = − 1 ℒ 𝑈𝑈𝑇2 𝑐, ෠ 𝑐 𝑡 log(෠ 𝑐 𝑡 ) + 1 − 𝑐 𝑡 log 1 − ෠ 𝑇 ෍ 𝑐 𝑡 𝑡=1 𝑦, 𝑐, ෠ 𝑦 + ℒ 𝑈𝑈𝑇2 𝑐, ෠ ℒ 𝑈𝑈𝑇 𝑦, ො 𝑐 = ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑐 16 CBHG: Convolution Bank + Highway + bi-GRU Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

17 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

End-to-end approach to ASR, TTS and Speech Translation Satoshi - PowerPoint PPT Presentation

End-to-end approach to ASR, TTS and Speech Translation Satoshi Nakamura 1,2 with Sakriani Sakti 1,2 , Andros Tjandra 1,2 ,Takatomo Kano, and Quoc Truong Do 1 Nara Institute of Science & Technology, Japan 2 RIKEN, Center for Advanced

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 11-492/18-495 Speech Processing 11-492/18-495 Spoken Dialog Systems Conversing

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

SDS Aplications - Speech-to-speech translation - Anca Burducea May 28, 2015 S2S Translation

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech Technology for Mobile Phones Part I : ASR, and TTS on the Mobile phone Rajesh M. Hegde

Speech Processing 15-492/18-492 Spoken Dialog Systems SDS components Spoken Dialog Systems More

What underlies between-frequency gap detection? Shuji Mori Kyushu University 2014 Symposium on

Spikes and Computation in Sensory Processing Simon Thorpe CerCo ( Brain and Cognition Research

CS440/ECE448 Lecture 26: Speech Mark Hasegawa-Johnson, 4/17/2019, CC-By 3.0 Outline Human

An Efferent-inspired Auditory Model Front-end for Speech Recognition Chia-ying Lee, James Glass

Overview Understanding the neural code Neural Encoding Encoding: Prediction of neural response to

Availability for Learning: The Forgotten Senses David Brown Deafblind Educational Specialist

Slide 1 / 39 Directions: Select/State the answer that best completes the statement or answers the

Basic Acoustics Graduate School of Culture Technology (GSCT) Juhan Nam 1 Outlines What is

Sambuz

Useful Links

Newsletter

Mail Us