Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter
Recall: SPSS framework
- Training
- Estimate acoustic model given speech utuerances (O), word sequences (W)
ˆ λ
- Synthesis
- Find the most probable ô from and a given word sequence to be
synthesised, w
- Synthesize speech from ô
ˆ
- = arg max
- p(o|w, ˆ
λ) ˆ λ = arg max
λ
p(O|W, λ)
Speech Analysis Text Analysis Train Model Parameter Generation Speech Synthesis Text Analysis
speech text
O W ô
ˆ λ
Synthesis using duration models
dis- tree state stress- be the since actors,
- an-
SYNTHETIC SPEECH MLSA Filter
Context Dependent Duration Models Context Dependent HMMs
Synthesis d d
c c c c c c c
Mel-Cepstrum State Duration HMM Sentence Densities State Duration
T 1 2 1 2 3 4 5 6
Use delta features for smooth trajectories TEXT
T or 𝜍
Pitch Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98
Transforming voice characteristics
- We studied about speaker
adaptation techniques for ASR
- Maximum a posteriori
(MAP) estimation
- Maximum Likelihood
Linear Regression (MLLR)
Transformed Model General Model Linear Transforms Regression Class
- Can also be applied to speech synthesis
- MLLR: estimate a set of linear transforms that map an existing model into an
adapted model s.t. the likelihood of the adaptation data is maximized
- For limited adaptation data, MLLR is more effective than MAP
Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009
Transforming voice characteristics
- What if no adaptation
data is available?
- HMM parameters can be interpolated
- Synthesize speech with varying voice characteristics not encountered
during training
λ1 λ2 λ3 λ4 λ′
I(λ′, λ1) I(λ′, λ2) I(λ′, λ3) I(λ′, λ4) Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009
GMM-based voice conversion
Vocoder Analysis Source Speech Source Speech Target Speech DTW Alignment JD-GMM Training Text Analysis GMM Mixture Decision Acoustic Parameter Conversion Vocoder Synthesis Converted Speech Conversion JD-GMM Training Vocoder Analysis
- Parallel training data:
Align source and target speech frame-by-frame
- Estimate a joint
distribution GMM to model the joint PDF between source/target features
- At conversion time,
predict the most likely converted acoustic features given a source acoustic feature sequence
Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015
Neural approaches to speech generation
Recall: DNN-based speech synthesis
Input layer Output layer Hidden layers TEXT
SPEECH
Parameter generation
... ... ...
Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction
...
Statistics (mean & var) of speech parameter vector sequence
x1
1
x1
2
x1
3
x1
4
xT
1
xT
2
xT
3
xT
4
h1
11
h1
12
h1
13
h1
14
hT
11
hT
12
hT
13
hT
14
y1
1
y1
2
y1
3
yT
1
yT
2
yT
3
h1
31
h1
32
h1
33
h1
34
hT
31
hT
32
hT
33
hT
34
...
h1
21
h1
22
h1
23
h1
24
hT
21
hT
22
hT
23
hT
24
Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
- Input features about linguistic
contexts, numeric values (# of words, duration of the phoneme, etc.)
- Output features are spectral and
excitation parameters and their delta values
Recall: RNN-based speech synthesis
Text Analysis Input Feature Extraction Input features Output Features Vocoder Waveform Text
- Access long range context
in both forward backward directions using biLSTMs
- Inference is expensive;
inherently have large latency
Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014
Frame-synchronous streaming speech synthesis
... ... ...
x(i)
TEXT Text analysis Linguistic feature extraction
... ... ... ...
Acoustic LSTM-RNN La Duration LSTM-RNN Ld Phoneme durations Frame-level linguistic features Phoneme-level linguistic features Acoustic features Vocoder Vocoder
... ... ...
Waveform
... ... ... ...
Recurrent
- utput
layer Output layer
x(1)
...
x(N) d (i) ˆ
1
x (i) x (i)
d (i) ˆ
1
y (i) y (i)
d (i) ˆ
... ... ...
ˆ ˆ
... ... ...
Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015
Deep generative models
Code (Gaussian, Uniform, etc.) Deep generative model Real data (images, sounds, etc.)
06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1
Image from https://blog.openai.com/generative-models/
Example: Autoregressive models (Wavenet)
Wavenet
- Speech synthesis using an auto-regressive generative model
- Generates waveform sample-by-sample:16kHz sampling rate
Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Wavenet
- Wavenet uses “dilated convolutions”
- Main limitation: Very slow generation rate [Oct 2017: Wavenet
deployed in Google Assistant1]
Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/
1https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/
Wavenet
- Reduced the gap between state-of-the-art and human performance
by > 50%
Recording 1 Recording 2 Recording 3
- Which of the three recordings sounded most natural?
Deep generative models
Code (Gaussian, Uniform, etc.) Deep generative model True data (images, sounds, etc.)
06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1
Image from https://blog.openai.com/generative-models/
Example: Generative Adversarial Networks (GANs)
GANs
- Training process is formulated as a game between a generator
network and a discriminative network
- Objective of the generator: Create samples that seem to be
from the same distribution as the training data
- Objective of the discriminator: Examine a generated sample
and distinguish between fake or real samples
- Solution to this game is an equilibrium between the generator
and the discriminator
- Refer to [Goodfellow16] for a detailed tutorial on GANs
[Goodfellow16]: https://arxiv.org/pdf/1701.00160.pdf
GANs for speech synthesis
Linguistic features OR Predicted samples Noise AND Natural samples MSE Binary classifier Discriminator: Generator:
- Generator produces
synthesised speech which the Discriminator distinguishes from real speech
- During synthesis, a
random noise + linguistic features generates speech
Image from Yang et al., “SPSS using GANs”, 2017
Course conclusion
Formalism: Finite State Transducers
Topics covered
speech signal
Acoustic Feature Generator SEARCH Acoustic Model (phones) Language Model
word sequence W* O
Pronunciation Model Properties
- f speech
sounds Acoustic Signal Processing Ngram/RNN LMs G2P models Search algorithms Hidden Markov Models Deep Neural Networks Hybrid HMM-DNN Systems Speaker Adaptation Discr. Training
speech signal
Mapping acoustic signals to word sequences
word sequence W* O
End-to-end Models Ngram/RNN LMs
Topics covered
Also, briefly covered
Conversational Agents Speech Synthesis
Exciting time to do speech research
Remaining coursework
Final Exam Syllabus
- 1. WFST algorithms + WFSTs used in ASR
- 2. EM algorithm
- 3. HMMs + Tied state Triphone HMMs
- 4. DNN/RNN-based acoustic models
- 5. N-gram/RNN language models
- 6. CTC end-to-end ASR
- 7. Pronunciation models
- 8. Search & decoding
- 9. Discriminative training for HMMs
- 10. Basics of speaker adaptation
- 11. HMM-based speech synthesis models
In the final exam, questions can be asked on any of the 11 topics listed above. Weightage of topics will be shared later on Moodle.
Final Project
Deliverables
- 5-8 page final report:
✓ Task definition, Methodology, Prior work, Implementation
Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary
- Short talk summarizing the project:
✓ Each team will get 10 mins for their presentation and 10
minutes for Q/A
✓ Clearly demarcate which team member worked on what part
Final Project Schedule
- Presentations will be tentatively held on Nov 27th and Nov
28th
- The final report in pdf format should be sent to
pjyothi@cse.iitb.ac.in before Nov 27th
- The order of presentations will be decided on a lotuery basis
and shared via Moodle by Nov 18th
Final Project Grading
- Break-up of 20 points:
- 6 points for the report
- 4 points for the presentation
- 6 points for Q/A
- 4 points for overall evaluation of the project