Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 Images on the first 11 slides are from Zen et al., Statistical Parametric


slide-1
SLIDE 1

Instructor: Preethi Jyothi Nov 2, 2017


Automatic Speech Recognition (CS753)

Lecture 24: Statistical Parametric Speech Synthesis

Automatic Speech Recognition (CS753)

Images on the first 11 slides are from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

slide-2
SLIDE 2

Parametric Speech Synthesis Framework

  • Training
  • Estimate acoustic model given speech utuerances (O), word sequences (W)

ˆ λ = arg max

λ

p(O|W, λ)

Speech
 Analysis Text
 Analysis Train
 Model Parameter
 Generation Speech
 Synthesis Text
 Analysis

speech text

O W ô

ˆ λ

slide-3
SLIDE 3

Parametric Speech Synthesis Framework

  • Training
  • Estimate acoustic model given speech utuerances (O), word sequences (W)

ˆ λ

  • Synthesis
  • Find the most probable ô from and a given word sequence to be synthesised, w
  • Synthesize speech from ô

ˆ

  • = arg max
  • p(o|w, ˆ

λ) ˆ λ = arg max

λ

p(O|W, λ)

HMMs! Speech
 Analysis Text
 Analysis Train
 Model Parameter
 Generation Speech
 Synthesis Text
 Analysis

speech text

O W ô

ˆ λ

slide-4
SLIDE 4

HMM-based speech synthesis

amount w weights syn- heuris- the concatenation-cost pro- and ) implementations All that

  • pti-

ger domain.

Training of HMM context-dependent HMMs & duration models Training part Synthesis part Labels Spectral parameters Excitation parameters Parameter generation from HMM TEXT Labels Text analysis SYNTHESIZED SPEECH Excitation generation Synthesis filter Spectral parameters Excitation parameters Speech signal Spectral parameter extraction Excitation parameter extraction SPEECH DATABASE

slide-5
SLIDE 5

Speech parameter generation

Generate the most probable observation vectors given the HMM and w:

ˆ q = arg max

q

p(q|w, ˆ λ) ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ)

Determine the best state sequence and outputs sequentially:

Let’s explore this first ˆ

  • = arg max
  • p(o|w, ˆ

λ) = arg max

  • X

∀q

p(o, q|w, ˆ λ) ≈ arg max

  • max

q

p(o, q|w, ˆ λ) = arg max

  • max

q

p(o|q, ˆ λ)P(q|w, ˆ λ)

slide-6
SLIDE 6

Determining state outputs

ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ) = arg max

  • N(o; µˆ

q, Σˆ q)

where o =

1 , . . . , o⊤ T

⊤ is a state-output vector sequence to be generated, q = {q1, . . . , qT } is a state sequence, and µq =

  • µ⊤

q1, . . . , µ⊤ qT

⊤ is the mean vector for q. Here, Σ = diag [Σ , . . . , Σ ] is the covariance matrix for q and

What would look like? ˆ

  • Variance

Mean

slide-7
SLIDE 7

Adding dynamic features to state outputs

State output vectors contain both static (ct) and dynamic (Δct) features

  • t =
  • c⊤

t , ∆c⊤ t

⊤ dynamic feature is calculated as7

∆ct = ct − ct−1 between and can

where

  • and c arranged in matrix form
  • W

c ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−1 ∆ct−1 ct ∆ct ct+1 ∆ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ · · · . . . . . . . . . . . . · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · . . . . . . . . . . . . · · · ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−2 ct−1 ct ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (17)

slide-8
SLIDE 8

Speech parameter generation

  • Introducing dynamic feature constraints:
  • If the output distributions are single Gaussians:
  • Then, by setuing we get:

ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ) where o = Wc p(o|ˆ q, ˆ λ) = N(o; µˆ

q, Σˆ q)

∂ log N(o; µˆ

q, Σˆ q)/∂c = 0

W T Σ−1

ˆ q Wc = W T Σ−1 ˆ q µˆ q

slide-9
SLIDE 9

Synthesis overview

Static Delta

Gaussian

Sentence HMM Merged states Clustered states

ML trajectory

slide-10
SLIDE 10

Speech parameter generation

Generate the most probable observation vectors given the HMM and w:

ˆ q = arg max

q

p(q|w, ˆ λ) ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ)

Determine the best state sequence and outputs sequentially:

Let’s explore this next ˆ

  • = arg max
  • p(o|w, ˆ

λ) = arg max

  • X

∀q

p(o, q|w, ˆ λ) ≈ arg max

  • max

q

p(o, q|w, ˆ λ) = arg max

  • max

q

p(o|q, ˆ λ)P(q|w, ˆ λ)

slide-11
SLIDE 11

Duration modeling

0.0 0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10

State duration d (frame)

pk(d) = ad−1

kk (1 − akk)

akk = 0.6

( ) State duration probability pk(d)

  • How are durations

modelled within an HMM?

  • Implicitly modelled by state

self-transition probabilities

  • PMFs of state durations are

geometric distributions

pk(d) = ad−1

kk

· (1 − akk)

log P (d | λ) =

N

X

j=1

log pj(dj),

  • State durations are determined by maximising:
  • What would this solution look like if the PMFs of state durations are

geometric distributions?

slide-12
SLIDE 12

Explicit modeling of state durations

  • Each state duration is explicitly modelled as a single Gaussian.

The mean and variance of duration density of state i:

ξ(i) =

T

  • t0=1

T

  • t1=t0

χt0,t1(i)(t1 − t0 + 1)

T

  • t0=1

T

  • t1=t0

χt0,t1(i) , σ2(i) =

T

  • t0=1

T

  • t1=t0

χt0,t1(i)(t1 − t0 + 1)2

T

  • t0=1

T

  • t1=t0

χt0,t1(i) − ξ2(i),

1

χt0,t1(i) = (1 − γt0−1(i)) ·

t1

  • t=t0

γt(i) · (1 − γt1+1(i)),

ξ(i) σ2(i)

where and γt(i) is the probability of being in state i at time t

slide-13
SLIDE 13

Determining state durations

During synthesis, for a given speech length T, the goal is to maximize:

log P(d|λ, T) =

K

X

k=1

log pk(dk)

under the constraint that T =

K

X

k=1

dk

We saw that each duration density can be modelled as a single
 Gaussian

pk(dk) N(·; ξk, σ2

k)

State durations, , which maximise (1) are given by:

dk, k = 1 . . . K

dk = ξ(k) + ρ · σ2(k) ρ =

  • T −

K

  • k=1

ξ(k)

K

  • k=1

σ2(k) where and are the mean and variance of the

Speaking
 rate

… (1)

slide-14
SLIDE 14

Synthesis using duration models

dis- tree state stress- be the since actors,

  • an-

TEXT SYNTHETIC SPEECH MLSA Filter Pitch T or ρ

Context Dependent Duration Models Context Dependent HMMs

Synthesis d d

c c c c c c c

Mel-Cepstrum State Duration HMM Sentence Densities State Duration

T 1 2 1 2 3 4 5 6

Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98

slide-15
SLIDE 15

DNN-based speech synthesis

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

slide-16
SLIDE 16

DNN-based speech synthesis

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  • Input features about linguistic

contexts, numeric values (# of words, duration of the phoneme, etc.)

  • Output features are spectral and

excitation parameters and their 
 delta values

Table 1. Preference scores (%) between speech samples from the HMM and DNN-based systems. The systems which achieved sig- nificantly better preference at p < 0.01 level are in the bold font. HMM DNN (˛) (#layers #units) Neutral p value z value 15.8 (16) 38.5 (4 256) 45.7 < 106

  • 9.9

16.1 (4) 27.2 (4 512) 56.8 < 106

  • 5.1

12.7 (1) 36.6 (4 1 024) 50.7 < 106

  • 11.5
  • Listening test results
slide-17
SLIDE 17

RNN-based speech synthesis

Text Analysis Input Feature Extraction Input features Output Features Vocoder Waveform Text

  • Access long range context

in both forward backward directions using biLSTMs

  • Inference is expensive; 


inherently have large latency

Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

slide-18
SLIDE 18

Streaming synthesis using LSTMs

... ... ...

x(i)

TEXT Text analysis Linguistic feature extraction

... ... ... ...

Acoustic LSTM-RNN La Duration LSTM-RNN Ld Phoneme durations Frame-level linguistic features Phoneme-level linguistic features Acoustic features Vocoder Vocoder

... ... ...

Waveform

... ... ... ...

Recurrent

  • utput

layer Output layer

x(1)

...

x(N) d (i) ˆ

1

x (i) x (i)

d (i) ˆ

1

y (i) y (i)

d (i) ˆ

... ... ...

ˆ ˆ

... ... ...

  • Duration prediction,

acoustic feature prediction and vocoding are done in streaming fashion

1: Perform text analysis over input text 2: Extract fx.i/gN

iD1

3: for i D 1; : : : ; N do

F Loop over phonemes

4:

Predict O d .i/ given x.i/ by ƒd

5:

for ⌧ D 1; : : : ; O di do F Loop over frames

6:

Compose x.i/

from x.i/, ⌧, and O d .i/

7:

Predict O y.i/

given x.i/

by ƒa

8:

Synthesize waveform given O y.i/

then stream result

9:

end for

10: end for

Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015

slide-19
SLIDE 19

Streaming synthesis using LSTMs

... ... ...

x(i)

TEXT Text analysis Linguistic feature extraction

... ... ... ...

Acoustic LSTM-RNN La Duration LSTM-RNN Ld Phoneme durations Frame-level linguistic features Phoneme-level linguistic features Acoustic features Vocoder Vocoder

... ... ...

Waveform

... ... ... ...

Recurrent

  • utput

layer Output layer

x(1)

...

x(N) d (i) ˆ

1

x (i) x (i)

d (i) ˆ

1

y (i) y (i)

d (i) ˆ

... ... ...

ˆ ˆ

... ... ...

Model # of params 5-scale MOS DNN 3 749 79 3.370 ˙ 0.114 LSTM-RNN 476 435 3.723 ˙ 0.105

  • Comparing DNN

versus LSTM-RNN systems

Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015