Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation
Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi - - PowerPoint PPT Presentation
Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi Project Preliminary Report Preliminary project report will contribute towards 5% of your final grade. Deadline is on 27th October, 2019. Define the following for your project:
Project Preliminary Report
- Preliminary project report will contribute towards 5% of your final grade. Deadline is on
27th October, 2019.
- Define the following for your project: 1) Input-output behaviour of your system
2) Evaluation metric 3) At least two existing (or related) approaches to your problem
- Propose a model and an algorithm for the problem you're tackling and give detailed
descriptions for both. Do not provide generic descriptions of the model. Describe precisely how it applies to your problem.
- Describe how much of your algorithm has been implemented. If you are using existing
APIs/libraries, clearly demarcate which parts you will be implementing and for which parts you will rely on existing implementations.
- Describe the experiments you are planning to run. If you have already run any
preliminary experiments, please describe them along with reporting your initial results.
5 points 5 points 5 points 5 points
Text-To-Speech (TTS) Systems
Storied History
- Von Kempelen’s speaking machine (1791)
- Bellows simulated the lungs
- Rubber mouth and nose; nostrils had to be covered with
two fingers for non-nasals
- Homer Dudley’s VODER (1939)
- First device to synthesize speech sounds via electrical
means
- Gunnar Fant’s OVE formant synthesizer (1960s)
- Formant synthesizer for vowels
- Computer-aided speech synthesis (1970s)
- Concatenative (unit selection)
- Parametric (HMM-based and NN-based)
All images from http://www2.ling.su.se/staff/hartmut/kemplne.htm
Speech synthesis or TTS systems
- Goal of a TTS system: Produce a natural-sounding high-
quality speech waveform for a given word sequence
- TTS systems are typically divided into two parts:
- A. Linguistic specification
- B. Waveform generation
Current TTS systems
- Constructed using a large amount of speech data
- Referred to as corpus-based TTS systems
- Two prominent instances of corpus-based TTS:
- 1. Unit selection and concatenation
- 2. Statistical parametric speech synthesis
Unit Selection Synthesis
Unit selection synthesis or Concatenative speech synthesis
All segments Target cost Concatenation cost
- Synthesize new
sentences by selecting sub-word units from a database of speech
- Optimal size of units?
Diphones? Half-phones?
Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2001
- Target cost between a candidate, ui, and a target unit ti:
- Concatenation cost between candidate units:
- Find string of units that minimises the overall cost:
Unit selection synthesis
C(t)(ti, ui) =
p
- j=1
w(t)
j C(t) j (ti, ui),
C(c)(ui−1, ui) =
q
- k=1
w(c)
k C(c) k (ui−1, ui),
ˆ u1:n = arg min
u1:n {C(t1:n, u1:n)}
C(t1:n, u1:n) =
n
- i=1
C(t)(ti, ui) +
n
- i=2
C(c)(ui−1, ui).
Target cost Concatenation cost Clustered segments
Unit selection synthesis
- Target cost is
pre-calculated using a clustering method
Statistical Parametric Speech Synthesis
Parametric Speech Synthesis Framework
- Training
- Estimate acoustic model given speech utterances (O), word sequences (W)*
ˆ λ = arg max
λ
p(O|W, λ)
Speech Analysis Text Analysis Train Model Parameter Generation Speech Synthesis Text Analysis
speech text
O W
ˆ λ
* Here W could refer to any textual features relevant to the input text
Parametric Speech Synthesis Framework
- Training
- Estimate acoustic model given speech utterances (O), word sequences (W)
ˆ λ
- Synthesis
- Find the most probable ô from and a given word sequence to be synthesised, w
- Synthesize speech from ô
ˆ
- = arg max
- p(o|w, ˆ
λ) ˆ λ = arg max
λ
p(O|W, λ)
HMMs! Speech Analysis Text Analysis Train Model Parameter Generation Speech Synthesis Text Analysis
speech text
O W ô
ˆ λ
HMM-based speech synthesis
amount w weights syn- heuris- the concatenation-cost pro- and ) implementations All that
- pti-
ger domain.
Training of HMM context-dependent HMMs & duration models Training part Synthesis part Labels Spectral parameters Excitation parameters Parameter generation from HMM TEXT Labels Text analysis SYNTHESIZED SPEECH Excitation generation Synthesis filter Spectral parameters Excitation parameters Speech signal Spectral parameter extraction Excitation parameter extraction SPEECH DATABASE
Speech parameter generation
Generate the most probable observation vectors given the HMM and w:
ˆ q = arg max
q
p(q|w, ˆ λ) ˆ
- = arg max
- p(o|ˆ
q, ˆ λ)
Determine the best state sequence and outputs sequentially:
Let’s explore this first
ˆ
- = arg max
- p(o|w, ˆ
λ) = arg max
- X
∀q
p(o, q|w, ˆ λ) ≈ arg max
- max
q
p(o, q|w, ˆ λ) = arg max
- max
q
p(o|q, ˆ λ)p(q|w, ˆ λ)
<latexit sha1_base64="lHhErvouZMX+5sUo7itrc2XQw/E=">ACw3icdVFdixMxFM2MH7vWj636EuwKBWMrMK64tQFMHFezuQlOGO2mDU0maXJHt8zOn/RNf42ZbhHbrhdCDuec29yb26V9Jgkv6L4zt179w8OH3QePnr85Kj79Nm5N5XjYsSNMu4yBy+ULMUIJSpxaZ0AnStxkS8+tfmL78J5acpvuLJiomFWykJywEBl3d9sDlibhr7+QBm4GdNwlRlq+b6x/E6x1SoNoXmDWs61ivtJZzQrjQCm6bFrb8fJ2IwNrnbnasrfX8v8mutvur/56uaO2/f0KWbeXDJ10H2QbkCPbOIs6/5kU8MrLUrkCrwfp4nFSQ0OJVei6bDKCwt8ATMxDrAELfykXu+goa8CM6VhFOGUSNfsv4atPcrnQelBpz73VxL3pYbV1i8n9SytBWKkt80KipF0dB2oXQqneCoVgEAdzK8lfI5OAY1t4JQ0h3v7wPzk8G6dvByd3veHzTgOyQvykvRJSk7JkHwhZ2REeDSMishENv4cL2IX40jae52Qr4uYPIL/afw=</latexit>Determining state outputs
ˆ
- = arg max
- p(o|ˆ
q, ˆ λ) = arg max
- N(o; µˆ
q, Σˆ q)
where o =
- ⊤
1 , . . . , o⊤ T
⊤ is a state-output vector sequence to be generated, q = {q1, . . . , qT } is a state sequence, and µq =
- µ⊤
q1, . . . , µ⊤ qT
⊤ is the mean vector for q. Here, Σ = diag [Σ , . . . , Σ ] is the covariance matrix for q and
What would look like? ˆ
- Variance
Mean
Adding dynamic features to state outputs
State output vectors contain both static (ct) and dynamic (Δct) features
- t =
- c⊤
t , ∆c⊤ t
⊤ dynamic feature is calculated as7
∆ct = ct − ct−1 between and can
where
- and c arranged in matrix form
- W
c ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−1 ∆ct−1 ct ∆ct ct+1 ∆ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ · · · . . . . . . . . . . . . · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · I · · · · · · −I I · · · · · · . . . . . . . . . . . . · · · ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ . . . ct−2 ct−1 ct ct+1 . . . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (17)
Speech parameter generation
- Introducing dynamic feature constraints:
- If the output distributions are single Gaussians:
- Then, by setting we get:
ˆ
- = arg max
- p(o|ˆ
q, ˆ λ) where o = Wc p(o|ˆ q, ˆ λ) = N(o; µˆ
q, Σˆ q)
∂ log N(o; µˆ
q, Σˆ q)/∂c = 0
W T Σ−1
ˆ q Wc = W T Σ−1 ˆ q µˆ q
Synthesis overview
Static Delta
Gaussian
Sentence HMM Merged states Clustered states
ML trajectory
Speech parameter generation
Generate the most probable observation vectors given the HMM and w:
ˆ q = arg max
q
p(q|w, ˆ λ) ˆ
- = arg max
- p(o|ˆ
q, ˆ λ)
Determine the best state sequence and outputs sequentially:
Let’s explore this next
ˆ
- = arg max
- p(o|w, ˆ
λ) = arg max
- X
∀q
p(o, q|w, ˆ λ) ≈ arg max
- max
q
p(o, q|w, ˆ λ) = arg max
- max
q
p(o|q, ˆ λ)P(q|w, ˆ λ)
Duration modeling
0.0 0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10
State duration d (frame)
pk(d) = ad−1
kk (1 − akk)
akk = 0.6
( )
State duration probability pk(d)
- How are durations
modelled within an HMM?
- Implicitly modelled by state
self-transition probabilities
- PMFs of state durations are
geometric distributions
pk(d) = ad−1
kk
· (1 − akk)
log P (d | λ) =
N
X
j=1
log pj(dj),
- State durations are determined by maximising:
- What would this solution look like if the PMFs of state durations are
geometric distributions?
Explicit modeling of state durations
- Each state duration is explicitly modelled as a single
- Gaussian. The mean and variance of duration
density of state i:
ξ(i) =
T
- t0=1
T
- t1=t0
χt0,t1(i)(t1 − t0 + 1)
T
- t0=1
T
- t1=t0
χt0,t1(i) , σ2(i) =
T
- t0=1
T
- t1=t0
χt0,t1(i)(t1 − t0 + 1)2
T
- t0=1
T
- t1=t0
χt0,t1(i) − ξ2(i),
1
χt0,t1(i) = (1 − γt0−1(i)) ·
t1
- t=t0
γt(i) · (1 − γt1+1(i)),
ξ(i) σ2(i)
where and γt(i) is the probability of being in state i at time t
Determining state durations
During synthesis, for a given speech length T, the goal is to maximize:
log P(d|λ, T) =
K
X
k=1
log pk(dk)
under the constraint that T =
K
X
k=1
dk
We saw that each duration density can be modelled as a single Gaussian
pk(dk) N(·; ξk, σ2
k)
State durations, , which maximise (1) are given by:
dk, k = 1 . . . K
dk = ξ(k) + ρ · σ2(k) ρ =
- T −
K
- k=1
ξ(k)
K
- k=1
σ2(k) where and are the mean and variance of the
Speaking rate
… (1)
Synthesis using duration models
dis- tree state stress- be the since actors,
- an-
TEXT SYNTHETIC SPEECH MLSA Filter Pitch T or ρ
Context Dependent Duration Models Context Dependent HMMs
Synthesis d d
c c c c c c c
Mel-Cepstrum State Duration HMM Sentence Densities State Duration
T 1 2 1 2 3 4 5 6
Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98
Training part Synthesis part
Training HMMs Context-dependent HMMs & state duration models
Labels Spectral parameters Excitation parameters TEXT Labels
SYNTHESIZED SPEECH
Speech signal Excitation
Parameter generation from HMMs Excitation generation Synthesis Filter Text analysis
Spectral parameter extraction
SPEECH DATABASE
Excitation parameter extraction Spectral parameters Excitation parameters
Recap: HMM-based speech synthesis
DNN-based speech synthesis
Input layer Output layer Hidden layers TEXT
SPEECH
Parameter generation
... ... ...
Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction
...
Statistics (mean & var) of speech parameter vector sequence
x1
1
x1
2
x1
3
x1
4
xT
1
xT
2
xT
3
xT
4
h1
11
h1
12
h1
13
h1
14
hT
11
hT
12
hT
13
hT
14
y1
1
y1
2
y1
3
yT
1
yT
2
yT
3
h1
31
h1
32
h1
33
h1
34
hT
31
hT
32
hT
33
hT
34
...
h1
21
h1
22
h1
23
h1
24
hT
21
hT
22
hT
23
hT
24
Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
DNN-based speech synthesis
Input layer Output layer Hidden layers TEXT
SPEECH
Parameter generation
... ... ...
Waveform synthesis Input features including binary & numeric features at frame 1 Input features including binary & numeric features at frame T Text analysis Input feature extraction
...
Statistics (mean & var) of speech parameter vector sequence
x1
1
x1
2
x1
3
x1
4
xT
1
xT
2
xT
3
xT
4
h1
11
h1
12
h1
13
h1
14
hT
11
hT
12
hT
13
hT
14
y1
1
y1
2
y1
3
yT
1
yT
2
yT
3
h1
31
h1
32
h1
33
h1
34
hT
31
hT
32
hT
33
hT
34
...
h1
21
h1
22
h1
23
h1
24
hT
21
hT
22
hT
23
hT
24
Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
- Input features about linguistic
contexts, numeric values (# of words, duration of the phoneme, etc.)
- Output features are spectral and
excitation parameters and their delta values
Table 1. Preference scores (%) between speech samples from the HMM and DNN-based systems. The systems which achieved sig- nificantly better preference at p < 0.01 level are in the bold font. HMM DNN (˛) (#layers #units) Neutral p value z value 15.8 (16) 38.5 (4 256) 45.7 < 106
- 9.9
16.1 (4) 27.2 (4 512) 56.8 < 106
- 5.1
12.7 (1) 36.6 (4 1 024) 50.7 < 106
- 11.5
- Listening test results
RNN-based speech synthesis
Text Analysis Input Feature Extraction Input features Output Features Vocoder Waveform Text
- Access long range context
in both forward backward directions using biLSTMs
- Inference is expensive;
inherently have large latency
Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014