Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation

speech synthesis
SMART_READER_LITE
LIVE PREVIEW

Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation

Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi Project Preliminary Report Preliminary project report will contribute towards 5% of your final grade. Deadline is on 27th October, 2019. Define the following for your project:


slide-1
SLIDE 1

Instructor: Preethi Jyothi

Speech Synthesis

Lecture 19

CS 753

slide-2
SLIDE 2

Project Preliminary Report

  • Preliminary project report will contribute towards 5% of your final grade. Deadline is on

27th October, 2019.

  • Define the following for your project: 1) Input-output behaviour of your system 


2) Evaluation metric 3) At least two existing (or related) approaches to your problem

  • Propose a model and an algorithm for the problem you're tackling and give detailed

descriptions for both. Do not provide generic descriptions of the model. Describe precisely how it applies to your problem.

  • Describe how much of your algorithm has been implemented. If you are using existing

APIs/libraries, clearly demarcate which parts you will be implementing and for which parts you will rely on existing implementations.

  • Describe the experiments you are planning to run. If you have already run any

preliminary experiments, please describe them along with reporting your initial results.

5 points 5 points 5 points 5 points

slide-3
SLIDE 3

Text-To-Speech (TTS) Systems


Storied History

  • Von Kempelen’s speaking machine (1791)
  • Bellows simulated the lungs
  • Rubber mouth and nose; nostrils had to be covered with 


two fingers for non-nasals

  • Homer Dudley’s VODER (1939)
  • First device to synthesize speech sounds via electrical 


means

  • Gunnar Fant’s OVE formant synthesizer (1960s)
  • Formant synthesizer for vowels
  • Computer-aided speech synthesis (1970s)
  • Concatenative (unit selection)
  • Parametric (HMM-based and NN-based)


All images from http://www2.ling.su.se/staff/hartmut/kemplne.htm

slide-4
SLIDE 4

Speech synthesis or TTS systems

  • Goal of a TTS system: Produce a natural-sounding high-

quality speech waveform for a given word sequence

  • TTS systems are typically divided into two parts:
  • A. Linguistic specification
  • B. Waveform generation
slide-5
SLIDE 5

Current TTS systems

  • Constructed using a large amount of speech data
  • Referred to as corpus-based TTS systems
  • Two prominent instances of corpus-based TTS:
  • 1. Unit selection and concatenation
  • 2. Statistical parametric speech synthesis
slide-6
SLIDE 6

Unit Selection Synthesis

slide-7
SLIDE 7

Unit selection synthesis or 
 Concatenative speech synthesis

All segments Target cost Concatenation cost

  • Synthesize new

sentences by selecting sub-word units from a database of speech

  • Optimal size of units?

Diphones? 
 Half-phones?

Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2001

slide-8
SLIDE 8
  • Target cost between a candidate, ui, and a target unit ti:
  • Concatenation cost between candidate units:
  • Find string of units that minimises the overall cost:

Unit selection synthesis

C(t)(ti, ui) =

p

  • j=1

w(t)

j C(t) j (ti, ui),

C(c)(ui−1, ui) =

q

  • k=1

w(c)

k C(c) k (ui−1, ui),

ˆ u1:n = arg min

u1:n {C(t1:n, u1:n)}

C(t1:n, u1:n) =

n

  • i=1

C(t)(ti, ui) +

n

  • i=2

C(c)(ui−1, ui).

slide-9
SLIDE 9

Target cost Concatenation cost Clustered segments

Unit selection synthesis

  • Target cost is 


pre-calculated using a clustering method

slide-10
SLIDE 10

Statistical Parametric Speech Synthesis

slide-11
SLIDE 11

Parametric Speech Synthesis Framework

  • Training
  • Estimate acoustic model given speech utterances (O), word sequences (W)*

ˆ λ = arg max

λ

p(O|W, λ)

Speech
 Analysis Text
 Analysis Train
 Model Parameter
 Generation Speech
 Synthesis Text
 Analysis

speech text

O W

ˆ λ

* Here W could refer to any textual features relevant to the input text

slide-12
SLIDE 12

Parametric Speech Synthesis Framework

  • Training
  • Estimate acoustic model given speech utterances (O), word sequences (W)

ˆ λ

  • Synthesis
  • Find the most probable ô from and a given word sequence to be synthesised, w
  • Synthesize speech from ô

ˆ

  • = arg max
  • p(o|w, ˆ

λ) ˆ λ = arg max

λ

p(O|W, λ)

HMMs! Speech
 Analysis Text
 Analysis Train
 Model Parameter
 Generation Speech
 Synthesis Text
 Analysis

speech text

O W ô

ˆ λ

slide-13
SLIDE 13

HMM-based speech synthesis

amount w weights syn- heuris- the concatenation-cost pro- and ) implementations All that

  • pti-

ger domain.

Training of HMM context-dependent HMMs & duration models Training part Synthesis part Labels Spectral parameters Excitation parameters Parameter generation from HMM TEXT Labels Text analysis SYNTHESIZED SPEECH Excitation generation Synthesis filter Spectral parameters Excitation parameters Speech signal Spectral parameter extraction Excitation parameter extraction SPEECH DATABASE

slide-14
SLIDE 14

Speech parameter generation

Generate the most probable observation vectors given the HMM and w:

ˆ q = arg max

q

p(q|w, ˆ λ) ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ)

Determine the best state sequence and outputs sequentially:

Let’s explore this first

ˆ

  • = arg max
  • p(o|w, ˆ

λ) = arg max

  • X

∀q

p(o, q|w, ˆ λ) ≈ arg max

  • max

q

p(o, q|w, ˆ λ) = arg max

  • max

q

p(o|q, ˆ λ)p(q|w, ˆ λ)

<latexit sha1_base64="lHhErvouZMX+5sUo7itrc2XQw/E=">ACw3icdVFdixMxFM2MH7vWj636EuwKBWMrMK64tQFMHFezuQlOGO2mDU0maXJHt8zOn/RNf42ZbhHbrhdCDuec29yb26V9Jgkv6L4zt179w8OH3QePnr85Kj79Nm5N5XjYsSNMu4yBy+ULMUIJSpxaZ0AnStxkS8+tfmL78J5acpvuLJiomFWykJywEBl3d9sDlibhr7+QBm4GdNwlRlq+b6x/E6x1SoNoXmDWs61ivtJZzQrjQCm6bFrb8fJ2IwNrnbnasrfX8v8mutvur/56uaO2/f0KWbeXDJ10H2QbkCPbOIs6/5kU8MrLUrkCrwfp4nFSQ0OJVei6bDKCwt8ATMxDrAELfykXu+goa8CM6VhFOGUSNfsv4atPcrnQelBpz73VxL3pYbV1i8n9SytBWKkt80KipF0dB2oXQqneCoVgEAdzK8lfI5OAY1t4JQ0h3v7wPzk8G6dvByd3veHzTgOyQvykvRJSk7JkHwhZ2REeDSMishENv4cL2IX40jae52Qr4uYPIL/afw=</latexit>
slide-15
SLIDE 15

Determining state outputs

ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ) = arg max

  • N(o; µˆ

q, Σˆ q)

where o =

1 , . . . , o⊤ T

⊤ is a state-output vector sequence to be generated, q = {q1, . . . , qT } is a state sequence, and µq =

  • µ⊤

q1, . . . , µ⊤ qT

⊤ is the mean vector for q. Here, Σ = diag [Σ , . . . , Σ ] is the covariance matrix for q and

What would look like? ˆ

  • Variance

Mean

slide-16
SLIDE 16

Adding dynamic features to state outputs

State output vectors contain both static (ct) and dynamic (Δct) features

  • t =
  • c⊤

t , ∆c⊤ t

⊤ dynamic feature is calculated as7

∆ct = ct − ct−1 between and can

where

  • and c arranged in matrix form
  • W

c ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−1 ∆ct−1 ct ∆ct ct+1 ∆ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ · · · . . . . . . . . . . . . · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · . . . . . . . . . . . . · · · ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−2 ct−1 ct ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (17)

slide-17
SLIDE 17

Speech parameter generation

  • Introducing dynamic feature constraints:
  • If the output distributions are single Gaussians:
  • Then, by setting we get:

ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ) where o = Wc p(o|ˆ q, ˆ λ) = N(o; µˆ

q, Σˆ q)

∂ log N(o; µˆ

q, Σˆ q)/∂c = 0

W T Σ−1

ˆ q Wc = W T Σ−1 ˆ q µˆ q

slide-18
SLIDE 18

Synthesis overview

Static Delta

Gaussian

Sentence HMM Merged states Clustered states

ML trajectory

slide-19
SLIDE 19

Speech parameter generation

Generate the most probable observation vectors given the HMM and w:

ˆ q = arg max

q

p(q|w, ˆ λ) ˆ

  • = arg max
  • p(o|ˆ

q, ˆ λ)

Determine the best state sequence and outputs sequentially:

Let’s explore this next

ˆ

  • = arg max
  • p(o|w, ˆ

λ) = arg max

  • X

∀q

p(o, q|w, ˆ λ) ≈ arg max

  • max

q

p(o, q|w, ˆ λ) = arg max

  • max

q

p(o|q, ˆ λ)P(q|w, ˆ λ)

slide-20
SLIDE 20

Duration modeling

0.0 0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10

State duration d (frame)

pk(d) = ad−1

kk (1 − akk)

akk = 0.6

( )

State duration probability pk(d)

  • How are durations

modelled within an HMM?

  • Implicitly modelled by state

self-transition probabilities

  • PMFs of state durations are

geometric distributions

pk(d) = ad−1

kk

· (1 − akk)

log P (d | λ) =

N

X

j=1

log pj(dj),

  • State durations are determined by maximising:
  • What would this solution look like if the PMFs of state durations are

geometric distributions?

slide-21
SLIDE 21

Explicit modeling of state durations

  • Each state duration is explicitly modelled as a single
  • Gaussian. The mean and variance of duration

density of state i:

ξ(i) =

T

  • t0=1

T

  • t1=t0

χt0,t1(i)(t1 − t0 + 1)

T

  • t0=1

T

  • t1=t0

χt0,t1(i) , σ2(i) =

T

  • t0=1

T

  • t1=t0

χt0,t1(i)(t1 − t0 + 1)2

T

  • t0=1

T

  • t1=t0

χt0,t1(i) − ξ2(i),

1

χt0,t1(i) = (1 − γt0−1(i)) ·

t1

  • t=t0

γt(i) · (1 − γt1+1(i)),

ξ(i) σ2(i)

where and γt(i) is the probability of being in state i at time t

slide-22
SLIDE 22

Determining state durations

During synthesis, for a given speech length T, the goal is to maximize:

log P(d|λ, T) =

K

X

k=1

log pk(dk)

under the constraint that T =

K

X

k=1

dk

We saw that each duration density can be modelled as a single
 Gaussian

pk(dk) N(·; ξk, σ2

k)

State durations, , which maximise (1) are given by:

dk, k = 1 . . . K

dk = ξ(k) + ρ · σ2(k) ρ =

  • T −

K

  • k=1

ξ(k)

K

  • k=1

σ2(k) where and are the mean and variance of the

Speaking
 rate

… (1)

slide-23
SLIDE 23

Synthesis using duration models

dis- tree state stress- be the since actors,

  • an-

TEXT SYNTHETIC SPEECH MLSA Filter Pitch T or ρ

Context Dependent Duration Models Context Dependent HMMs

Synthesis d d

c c c c c c c

Mel-Cepstrum State Duration HMM Sentence Densities State Duration

T 1 2 1 2 3 4 5 6

Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98

slide-24
SLIDE 24

Training part Synthesis part

Training HMMs Context-dependent HMMs & state duration models

Labels Spectral parameters Excitation parameters TEXT Labels

SYNTHESIZED SPEECH

Speech signal Excitation

Parameter generation from HMMs Excitation generation Synthesis Filter Text analysis

Spectral parameter extraction

SPEECH DATABASE

Excitation parameter extraction Spectral parameters Excitation parameters

Recap: HMM-based speech synthesis

slide-25
SLIDE 25

DNN-based speech synthesis

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

slide-26
SLIDE 26

DNN-based speech synthesis

Input layer Output layer Hidden layers TEXT

SPEECH

Parameter generation

... ... ...

Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction

...

Statistics (mean & var) of speech parameter vector sequence

x1

1

x1

2

x1

3

x1

4

xT

1

xT

2

xT

3

xT

4

h1

11

h1

12

h1

13

h1

14

hT

11

hT

12

hT

13

hT

14

y1

1

y1

2

y1

3

yT

1

yT

2

yT

3

h1

31

h1

32

h1

33

h1

34

hT

31

hT

32

hT

33

hT

34

...

h1

21

h1

22

h1

23

h1

24

hT

21

hT

22

hT

23

hT

24

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  • Input features about linguistic

contexts, numeric values (# of words, duration of the phoneme, etc.)

  • Output features are spectral and

excitation parameters and their 
 delta values

Table 1. Preference scores (%) between speech samples from the HMM and DNN-based systems. The systems which achieved sig- nificantly better preference at p < 0.01 level are in the bold font. HMM DNN (˛) (#layers #units) Neutral p value z value 15.8 (16) 38.5 (4 256) 45.7 < 106

  • 9.9

16.1 (4) 27.2 (4 512) 56.8 < 106

  • 5.1

12.7 (1) 36.6 (4 1 024) 50.7 < 106

  • 11.5
  • Listening test results
slide-27
SLIDE 27

RNN-based speech synthesis

Text Analysis Input Feature Extraction Input features Output Features Vocoder Waveform Text

  • Access long range context

in both forward backward directions using biLSTMs

  • Inference is expensive; 


inherently have large latency

Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014