Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter


slide-1
SLIDE 1

Instructor: Preethi Jyothi Nov 6, 2017


Automatic Speech Recognition (CS753)

Lecture 25: Speech synthesis (Concluding lecture)

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Recall: SPSS framework

  • Training
  • Estimate acoustic model given speech utuerances (O), word sequences (W)

ˆ λ

  • Synthesis
  • Find the most probable ô from and a given word sequence to be

synthesised, w

  • Synthesize speech from ô

ˆ

  • = arg max
  • p(o|w, ˆ

λ) ˆ λ = arg max

λ

p(O|W, λ)

Speech
 Analysis Text
 Analysis Train
 Model Parameter
 Generation Speech
 Synthesis Text
 Analysis

speech text

O W ô

ˆ λ

slide-3
SLIDE 3

Synthesis using duration models

dis- tree state stress- be the since actors,

  • an-

SYNTHETIC SPEECH MLSA Filter

Context Dependent Duration Models Context Dependent HMMs

Synthesis d d

c c c c c c c

Mel-Cepstrum State Duration HMM Sentence Densities State Duration

T 1 2 1 2 3 4 5 6

Use delta features
 for smooth trajectories TEXT

T or 𝜍

Pitch Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98

slide-4
SLIDE 4

Transforming voice characteristics

  • We studied about speaker

adaptation techniques for ASR

  • Maximum a posteriori

(MAP) estimation

  • Maximum Likelihood

Linear Regression (MLLR)

Transformed Model General Model Linear Transforms Regression Class

  • Can also be applied to speech synthesis
  • MLLR: estimate a set of linear transforms that map an existing model into an

adapted model s.t. the likelihood of the adaptation data is maximized

  • For limited adaptation data, MLLR is more effective than MAP

Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

slide-5
SLIDE 5

Transforming voice characteristics

  • What if no adaptation

data is available?

  • HMM parameters can be interpolated
  • Synthesize speech with varying voice characteristics not encountered 


during training

λ1 λ2 λ3 λ4 λ′

I(λ′, λ1) I(λ′, λ2) I(λ′, λ3) I(λ′, λ4) Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

slide-6
SLIDE 6

GMM-based voice conversion

Vocoder Analysis Source Speech Source Speech Target Speech DTW Alignment JD-GMM Training Text Analysis GMM Mixture Decision Acoustic Parameter Conversion Vocoder Synthesis Converted Speech Conversion JD-GMM Training Vocoder Analysis

  • Parallel training data:

Align source and target speech frame-by-frame

  • Estimate a joint

distribution GMM to model the joint PDF between source/target features

  • At conversion time,

predict the most likely converted acoustic features given a source acoustic feature sequence

Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015

slide-7
SLIDE 7

Neural approaches to speech generation

slide-8
SLIDE 8

Recall: DNN-based speech synthesis

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  • Input features about linguistic

contexts, numeric values (# of words, duration of the phoneme, etc.)

  • Output features are spectral and

excitation parameters and their 
 delta values

slide-9
SLIDE 9

Recall: RNN-based speech synthesis

Text Analysis Input Feature Extraction Input features Output Features Vocoder Waveform Text

  • Access long range context

in both forward backward directions using biLSTMs

  • Inference is expensive; 


inherently have large latency

Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

slide-10
SLIDE 10

Frame-synchronous streaming 
 speech synthesis

... ... ...

x(i)

TEXT Text analysis Linguistic feature extraction

... ... ... ...

Acoustic LSTM-RNN La Duration LSTM-RNN Ld Phoneme durations Frame-level linguistic features Phoneme-level linguistic features Acoustic features Vocoder Vocoder

... ... ...

Waveform

... ... ... ...

Recurrent

  • utput

layer Output layer

x(1)

...

x(N) d (i) ˆ

1

x (i) x (i)

d (i) ˆ

1

y (i) y (i)

d (i) ˆ

... ... ...

ˆ ˆ

... ... ...

Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015

slide-11
SLIDE 11

Deep generative models

Code (Gaussian, Uniform, etc.) Deep generative model Real data
 (images, sounds, etc.)

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

Image from https://blog.openai.com/generative-models/

Example:
 Autoregressive 
 models (Wavenet)

slide-12
SLIDE 12

Wavenet

  • Speech synthesis using an auto-regressive generative model
  • Generates waveform sample-by-sample:16kHz sampling rate

Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-13
SLIDE 13

Wavenet

  • Wavenet uses “dilated convolutions”
  • Main limitation: Very slow generation rate [Oct 2017: Wavenet

deployed in Google Assistant1]

Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

1https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

slide-14
SLIDE 14

Wavenet

  • Reduced the gap between state-of-the-art and human performance 


by > 50%

Recording 1 Recording 2 Recording 3

  • Which of the three recordings sounded most natural?
slide-15
SLIDE 15

Deep generative models

Code (Gaussian, Uniform, etc.) Deep generative model True data
 (images, sounds, etc.)

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

Image from https://blog.openai.com/generative-models/

Example:
 Generative Adversarial Networks (GANs)

slide-16
SLIDE 16

GANs

  • Training process is formulated as a game between a generator

network and a discriminative network

  • Objective of the generator: Create samples that seem to be

from the same distribution as the training data

  • Objective of the discriminator: Examine a generated sample

and distinguish between fake or real samples

  • Solution to this game is an equilibrium between the generator

and the discriminator

  • Refer to [Goodfellow16] for a detailed tutorial on GANs

[Goodfellow16]: https://arxiv.org/pdf/1701.00160.pdf

slide-17
SLIDE 17

GANs for speech synthesis

Linguistic features OR Predicted samples Noise AND Natural samples MSE Binary classifier Discriminator: Generator:

  • Generator produces 


synthesised speech which the Discriminator distinguishes from real speech

  • During synthesis, a

random noise + linguistic features generates speech

Image from Yang et al., “SPSS using GANs”, 2017

slide-18
SLIDE 18

Course conclusion

slide-19
SLIDE 19

Formalism: Finite State Transducers

Topics covered

speech
 signal


Acoustic
 Feature
 Generator SEARCH Acoustic
 Model (phones) Language
 Model

word sequence
 W* O

Pronunciation
 Model Properties

  • f speech

sounds Acoustic
 Signal Processing Ngram/RNN LMs G2P models Search algorithms Hidden Markov Models Deep Neural Networks Hybrid HMM-DNN
 Systems Speaker Adaptation Discr.
 Training

slide-20
SLIDE 20

speech
 signal


Mapping acoustic signals
 to
 word sequences

word sequence
 W* O

End-to-end
 Models Ngram/RNN LMs

Topics covered

Also, briefly covered

Conversational
 Agents Speech
 Synthesis

slide-21
SLIDE 21

Exciting time to do speech research

slide-22
SLIDE 22

Remaining coursework

slide-23
SLIDE 23

Final Exam Syllabus

  • 1. WFST algorithms + WFSTs used in ASR
  • 2. EM algorithm
  • 3. HMMs + Tied state Triphone HMMs
  • 4. DNN/RNN-based acoustic models
  • 5. N-gram/RNN language models
  • 6. CTC end-to-end ASR
  • 7. Pronunciation models
  • 8. Search & decoding
  • 9. Discriminative training for HMMs
  • 10. Basics of speaker adaptation
  • 11. HMM-based speech synthesis models

In the final exam, questions can be asked on any of the 11 topics listed above. Weightage of topics will be shared later on Moodle.

slide-24
SLIDE 24

Final Project

Deliverables

  • 5-8 page final report:

✓ Task definition, Methodology, Prior work, Implementation

Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary

  • Short talk summarizing the project:

✓ Each team will get 10 mins for their presentation and 10

minutes for Q/A

✓ Clearly demarcate which team member worked on what part

slide-25
SLIDE 25

Final Project Schedule

  • Presentations will be tentatively held on Nov 27th and Nov

28th

  • The final report in pdf format should be sent to

pjyothi@cse.iitb.ac.in before Nov 27th

  • The order of presentations will be decided on a lotuery basis

and shared via Moodle by Nov 18th

slide-26
SLIDE 26

Final Project Grading

  • Break-up of 20 points:
  • 6 points for the report
  • 4 points for the presentation
  • 6 points for Q/A
  • 4 points for overall evaluation of the project