Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017  

Recall: SPSS framework O ô Speech   Speech   Train   Parameter   speech Synthesis Analysis Model Generation ˆ W λ Text   Text   text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be • λ synthesised, w p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •

Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or 𝜍 State Duration d d 1 2 Use delta features   c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T for smooth trajectories Pitch MLSA Filter the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-

Transforming voice characteristics We studied about speaker • Transformed Model adaptation techniques for ASR Linear Transforms Maximum a posteriori • (MAP) estimation Maximum Likelihood • Linear Regression Regression Class General Model (MLLR) Can also be applied to speech synthesis • MLLR: estimate a set of linear transforms that map an existing model into an • adapted model s.t. the likelihood of the adaptation data is maximized For limited adaptation data, MLLR is more e ff ective than MAP • Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

Transforming voice characteristics What if no adaptation • data is available? λ 2 λ 1 I ( λ ′ , λ 2 ) I ( λ ′ , λ 1 ) λ ′ I ( λ ′ , λ 3 ) λ 3 I ( λ ′ , λ 4 ) λ 4 HMM parameters can be interpolated • Synthesize speech with varying voice characteristics not encountered   • during training Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

GMM-based voice conversion Parallel training data: • Training Source Vocoder Align source and target Speech Analysis speech frame-by-frame DTW Alignment Target Vocoder Estimate a joint • Speech Analysis distribution GMM to JD-GMM Training model the joint PDF JD-GMM between source/target features Conversion GMM Acoustic Source Text Mixture Parameter Speech Analysis Decision Conversion At conversion time, • predict the most likely Converted Vocoder converted acoustic Speech Synthesis features given a source acoustic feature sequence Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015

Neural approaches to speech generation

Recall: DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 Input features about linguistic y 1 • 2 x 1 h 1 h 1 h 1 3 13 23 33 contexts, numeric values (# of words, y 1 3 x 1 h 1 h 1 h 1 duration of the phoneme, etc.) 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 Output features are spectral and • features at frame T y T binary & numeric 1 x T h T h T h T excitation parameters and their   2 12 22 32 y T 2 delta values x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

Recall: RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive;   • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

Frame-synchronous streaming   ... ... ... Waveform speech synthesis Vocoder Vocoder Acoustic ... ... ... y ( i ) y ( i ) ˆ features ˆ d ( i ) ˆ 1 ... ... ... Recurrent output layer ... ... ... Acoustic LSTM-RNN L a Frame-level ... ... ... linguistic x ( i ) x ( i ) d ( i ) ˆ 1 features d ( i ) ˆ ... Phoneme ... Output layer durations ... ... Duration LSTM-RNN L d Phoneme-level ... ... x ( 1 ) x ( i ) x ( N ) linguistic features Linguistic feature extraction Text analysis TEXT Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code Real data   (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example:   Autoregressive   models (Wavenet) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

Wavenet Speech synthesis using an auto-regressive generative model • Generates waveform sample-by-sample:16kHz sampling rate • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Wavenet Wavenet uses “dilated convolutions” • Main limitation: Very slow generation rate [Oct 2017: Wavenet • deployed in Google Assistant 1 ] Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

Wavenet Reduced the gap between state-of-the-art and human performance   • by > 50% Recording 1 Recording 2 Recording 3 Which of the three recordings sounded most natural? •

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code True data   (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example:   Generative Adversarial Networks (GANs) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

GANs Training process is formulated as a game between a generator • network and a discriminative network Objective of the generator : Create samples that seem to be • from the same distribution as the training data Objective of the discriminator : Examine a generated sample • and distinguish between fake or real samples • Solution to this game is an equilibrium between the generator and the discriminator • Refer to [ Goodfellow16 ] for a detailed tutorial on GANs [ Goodfellow16 ]: https://arxiv.org/pdf/1701.00160.pdf

GANs for speech synthesis Discriminator: Generator produces   • synthesised speech Binary OR which the classifier Discriminator distinguishes from real speech Linguistic features Natural samples During synthesis, a • Generator: random noise + MSE AND linguistic features generates speech Noise Predicted samples Image from Yang et al., “SPSS using GANs”, 2017

Course conclusion

Topics covered Formalism: Finite State Transducers Hybrid Deep Hidden Acoustic   Speaker Discr.   HMM-DNN   Neural Markov Model Adaptation Training Systems Networks Models (phones) Acoustic   Pronunciation   Feature   SEARCH G2P models Model Generator speech   O signal   Properties Search of speech Language   Ngram/RNN word sequence   algorithms sounds Model LMs W * Acoustic   Signal Processing

Topics covered End-to-end   Models Mapping acoustic signals   Ngram/RNN to   LMs O word sequences speech   signal   word sequence   W * Also, briefly covered Conversational   Agents Speech   Synthesis

Exciting time to do speech research

Remaining coursework

Final Exam Syllabus 1. WFST algorithms + WFSTs used in ASR 2. EM algorithm 3. HMMs + Tied state Triphone HMMs 4. DNN/RNN-based acoustic models 5. N-gram/RNN language models 6. CTC end-to-end ASR 7. Pronunciation models 8. Search & decoding 9. Discriminative training for HMMs 10. Basics of speaker adaptation 11. HMM-based speech synthesis models In the final exam, questions can be asked on any of the 11 topics listed above. Weightage of topics will be shared later on Moodle.

Final Project Deliverables 5-8 page final report: • ✓ Task definition, Methodology, Prior work, Implementation Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary Short talk summarizing the project: • ✓ Each team will get 10 mins for their presentation and 10 minutes for Q/A ✓ Clearly demarcate which team member worked on what part

Final Project Schedule Presentations will be tentatively held on Nov 27th and Nov • 28th The final report in pdf format should be sent to • pjyothi@cse.iitb.ac.in before Nov 27th The order of presentations will be decided on a lo tu ery basis • and shared via Moodle by Nov 18th

Final Project Grading Break-up of 20 points: • 6 points for the report • 4 points for the presentation • 6 points for Q/A • 4 points for overall evaluation of the project •

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

& HMM DTW

Music Synchronization Meinard Mller International Audio Laboratories Erlangen

Insight Power Smart Outlet Team 15 || Advisor : David Irwin Brendon Burke Mark Chisholm Garrett

Learning Distance for Sequences by Learning a Ground Metric Bing Su Ying Wu

Online Identification of parameters in time dependent differential equations (using partial

Why do you care? Time-series data is all over the place. Time-Series Data Kaitlin Duck

Using a childrens gameshow to study iterated learning and the emergence of combinatoriality

SensiBol Audio Technologies Making Sense from Sound Audio Technology Solutions for

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

&amp; HMM DTW

Music Synchronization Meinard Mller International Audio Laboratories Erlangen

Insight Power Smart Outlet Team 15 || Advisor : David Irwin Brendon Burke Mark Chisholm Garrett

Learning Distance for Sequences by Learning a Ground Metric Bing Su Ying Wu

Online Identification of parameters in time dependent differential equations (using partial

Why do you care? Time-series data is all over the place. Time-Series Data Kaitlin Duck

Using a childrens gameshow to study iterated learning and the emergence of combinatoriality

SensiBol Audio Technologies Making Sense from Sound Audio Technology Solutions for

& HMM DTW