KDE-HMMs New, Nonparametric Acoustic Models for Speech Synthesis - - PowerPoint PPT Presentation

kde hmms
SMART_READER_LITE
LIVE PREVIEW

KDE-HMMs New, Nonparametric Acoustic Models for Speech Synthesis - - PowerPoint PPT Presentation

KDE-HMMs New, Nonparametric Acoustic Models for Speech Synthesis Gustav Eje Henter Joint work with W. Bastiaan Kleijn and Arne Leijon at KTH CSTR internal presentation Monday 20 January, 2014 Gustav Eje Henter (CSTR) KDE-HMMs for Speech


slide-1
SLIDE 1

KDE-HMMs

New, Nonparametric Acoustic Models for Speech Synthesis Gustav Eje Henter Joint work with W. Bastiaan Kleijn and Arne Leijon at KTH CSTR internal presentation Monday 20 January, 2014

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 1

slide-2
SLIDE 2

Take-Home Message

Current acoustic models in parametric speech synthesis are not a good fit We present a new acoustic model for speech that

1 Converges asymptotically on the true data-generating process 2 Can be interpreted as probabilistic hybrid speech synthesis 3 Models nonlinear time series better

The advantages come thanks to nonparametric speech synthesis

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 2

slide-3
SLIDE 3

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 3

slide-4
SLIDE 4

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 3

slide-5
SLIDE 5

Standard Sequence Models

Markovian paradigm

  • Finite-length memory
  • Examples:
  • Discrete Markov chain pXt|Xt−1 (xt | xt−1)
  • Linear autoregressive (AR) models

Xt = µ +

p

  • l=1

αl (xt−l − µ) + Et

Xt−1 Xt Xt+1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 4

slide-6
SLIDE 6

Standard Sequence Models

Hidden-state paradigm

  • Unbounded memory
  • Admits a control signal
  • Examples:
  • Hidden Markov model (discrete state Qt)
  • Kalman filter (continuous state)

Xt−1 Xt Xt+1 Qt−1 Qt Qt+1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 5

slide-7
SLIDE 7

Standard HMM Acoustic Model

Standard models for parametric speech synthesis are HMMs or HSMMs

  • States Qt represent (sub)phone, context, and prosodic information
  • Observables X t ∈ RD are vocoder parameters
  • State-conditional output distributions fX t|Qt (xt | qt) are Gaussian
  • Dynamic features (∆s and ∆∆s) tie adjacent observations together
  • Autoregressive HMMs (AR-HMMs) less mathematically objectionable

Xt−1 Xt Xt+1 Qt−1 Qt Qt+1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 6

slide-8
SLIDE 8

Problems

Even using ground-truth durations, generated features are poor

  • Sampled output is warbly (Shannon, Zen, & Byrne, 2011)
  • Most probable output sequence (ML parameter generation, MLPG)

sounds muffled and buzzy Note: Unit selection does not have these problems

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 7

slide-9
SLIDE 9

Problem Analysis

What is wrong with our parametric models?

  • The model is inadequate
  • State-conditional outputs are overly simplistic—essentially just linear

AR processes

  • Results on full-covariance models from Shannon, Zen, & Byrne (2011)

suggest that trajectory time dependence is not well modelled

  • Nonlinear AR models are a closer match
  • Product of experts increase held-out data likelihood substantially, but

not synthesis quality (Shannon, 2012)

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 8

slide-10
SLIDE 10

New Idea

What to do?

  • No one knows what the “true” distribution f of speech is
  • It is not obvious how to improve current models
  • This calls for a generally applicable technique!
  • Proposal: Kernel Conditional Density Estimation + Markov processes
  • Can describe any Markov model
  • Then add hidden state to control process output

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 9

slide-11
SLIDE 11

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 10

slide-12
SLIDE 12

Kernel Density Estimation

Kernel Density Estimation (KDE) is a nonparametric density estimation technique

  • Training data D = {y 1, . . . , yN} in RD sampled from reference fX
  • Test points {x1, . . . , xT}
  • KDE can be seen as a smoothing or blurring (convolution) of the

empirical density function ˙ fX (x | D) = 1 N

N

  • n=1

δ (x − y n) with a nonnegative kernel function k (r)

  • Intuition: KDE is to squint while looking at the datapoints

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 11

slide-13
SLIDE 13

Kernel Density Estimation

  • The estimated PDF can be written
  • fX (x | D, h) = 1

N

N

  • n=1

1 hD k x − y n h

  • where h is a bandwidth parameter controlling the degree of smoothing
  • We require

´

r k (r) dr = 1 and

´

r rk (r) dr = 0

  • Probabilistic interpretation:
  • Mixture distribution with k (r)-shaped zero-mean components
  • One component centered on each training-data point
  • We use Gaussian kernels throughout
  • Bandwidth h matters more than kernel shape k (r)

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 12

slide-14
SLIDE 14

Example Data

Running example: Santa Fe chaotic FIR laser series (1D, N = 1000 plotted) Time index t Laser intensity xt 500 1000 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 13

slide-15
SLIDE 15

Example Data

Running example: Santa Fe chaotic FIR laser series (detail) Time index t Laser intensity xt 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 14

slide-16
SLIDE 16

Example Data

Scatter plot of consecutive values {(xt, xt+1)}t reveals attractor structure

Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 15

slide-17
SLIDE 17

Example KDE

Gaussian blur of points = 2D KDE (bandwidth h optimised for log-prob)

Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 16

slide-18
SLIDE 18

Example KDE

Scatter plot superimposed on 2D KDE fit

Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 17

slide-19
SLIDE 19

KDE Properties

Strengths:

  • Asymptotically consistent: limN→∞

fX = fX under appropriate bandwidth selection (h → 0, Nh → ∞), regardless of fX

  • Built from data points (nonparametric)
  • Single free parameter

Weaknesses:

  • Data demanding
  • Computationally demanding
  • Substantial speedups are possible (e.g., Holmes, Gray, & Isbell, 2007)

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 18

slide-20
SLIDE 20

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 19

slide-21
SLIDE 21

Handling Time Dependence

So far we have said nothing about time dependence

  • Key idea: A joint KDE PDF

fX t

t−p

  • xt

t−p

  • for sequence segments

xt

t−p =

  • x⊺

t−p, . . . , x⊺ t−1, x⊺ t

⊺ induces a conditional distribution fX t|X t−1

t−p

  • xt | xt−1

t−p

  • Hyndman, Bashtannyk, & Grunwald (1996)
  • These next-step distributions are sufficient to define a p-order Markov

process

  • KDE Markov model (KDE-MM)
  • Nonlinear and nonparametric
  • Many independent proposals, e.g., Rajarshi (1990)

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 20

slide-22
SLIDE 22

Graphical Illustration

A conditional distribution is a cut through the KDE

Given value xt−1 Subsequent value xt 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 21

slide-23
SLIDE 23

Graphical Illustration

Resulting normalised next-step PDF fXt|Xt−1 (x | xt−1 = 100)

Subsequent value xt Conditional PDF fXt|Xt−1(xt | 100) 50 100 150 200 250 0.005 0.01 0.015

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 22

slide-24
SLIDE 24

KCDE Definition

Kernel Conditional Density Estimation (KCDE) is a normalisation of the KDE, with resulting PDF

  • fX t|X t−1

t−p

  • xt | xt−1

t−p, D

  • = 1

hD

  • n

p

l=0 k

xt−l−y n−l

h

  • n

p

l=1 k

xt−l−y n−l

h

, assuming the kernel factors as k (r) = p

l=0 k (r l)

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 23

slide-25
SLIDE 25

KDE-MM Remarks

  • KDE-MM converges on the true process as N → ∞
  • Subject to some technical criteria
  • Ergodicity, stationarity, appropriate bandwidth selection
  • Maximum likelihood estimation for h is inappropriate
  • Training set likelihood is degenerate as h → 0
  • One component centered on each data point

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 24

slide-26
SLIDE 26

Degeneracy Illustrated

As h → 0, kernels become spikes at the points in D; no generalisation

Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 25

slide-27
SLIDE 27

Degeneracy Circumvented

Maximising the pseudo-likelihood (a kind of cross-validation)

  • fX
  • y T

1 | D, h

  • =
  • n

1 hD

  • n′=n

p

l=0 k

y n−l−y n′−l

h

  • n′=n

p

l=1 k

y n−l−y n′−l

h

  • prevents points from “explaining themselves”

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 26

slide-28
SLIDE 28

A Mixture Model

Rewrite the KDE-MM PDF as

  • fX t|X t−1

t−p

  • xt | xt−1

t−p

  • =
  • n

p

l=1 k

xt−l−y n−l

h

  • n′

p

l=1 k

xt−l−y n′−l

h

1 hD k xt − y n h

  • =
  • n

wn

  • xt−1

t−p

1 hD k xt − y n h

  • This is a mixture distribution with context-dependent weights

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 27

slide-29
SLIDE 29

KDE-MM Output

KDE-MM data generation algorithm:

1 Given xt−1 t−p, one selects a mixture component zt ≤ N according to

pZt|X t−1

t−p

  • zt | xt−1

t−p

  • = wzt
  • xt−1

t−p

  • =

p

l=1 k

xt−l−y z−l

h

  • n

p

l=1 k

xt−l−y n−l

h

  • 2 xt = yzt + ηt, where ηt is kernel-shaped IID noise

3 Increment t and start over

Xt−1 Xt Xt+1 Zt−1 Zt Zt+1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 28

slide-30
SLIDE 30

Connection to Unit Selection

  • Data-driven output generation
  • Concatenate well-matching data frames (plus some noise)
  • Follow single trajectories in isolated regions
  • May switch to another trajectory where the context is ambiguous
  • The bandwidth h controls context sensitivity
  • Reminiscent of unit selection synthesis
  • h → 0 approaches unit selection, but fully probabilistic!
  • Also similar to the time-series bootstrap from statistics

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 29

slide-31
SLIDE 31

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 30

slide-32
SLIDE 32

Evaluation

p-order KDE-MMs vs. linear AR models on held-out laser data (N = 3000) Markov model order p Test set per-sample log-probability 2 4 6 8 10 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 KDE-MM AR model

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 31

slide-33
SLIDE 33

Reference Data

Excerpt from original laser data-series Time index t Laser intensity xt 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 32

slide-34
SLIDE 34

Sample Output

Sample from best linear AR model (order p = 10) Time index t Output value xt 50 100 150 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 33

slide-35
SLIDE 35

Sample Output

Sample from best KDE-MM (p = 6) Time index t Output value xt 50 100 150 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 34

slide-36
SLIDE 36

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 35

slide-37
SLIDE 37

KDE in Synthesis

To use KDE/KCDE in synthesis, we need a hidden state to control the

  • utput
  • Novel proposal: KDE-HMM (a nonlinear autoregressive HMM)
  • Nonlinear autoregressive HMM
  • States follow a Markov chain pQt|Qt−1 (qt | qt−1)
  • State-conditional next-step distribution

fX t|Qt, X t−1

t−p

  • xt | qt, xt−1

t−p

  • switches between KDE-MMs

Xt−1 Xt Xt+1 Zt−1 Zt Zt+1 Qt−1 Qt Qt+1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 36

slide-38
SLIDE 38

KDE-HMM Details

  • Data points n are assigned to states using weights wqn
  • wqn ≥ 0, with N

n=1 wqn = 1 for normalisation

  • It is compelling to relax parts of the model
  • State and lag-dependent bandwidths hql
  • Assuming a scalar series, the resulting PDF is
  • fXt|Qt, X t−1

t−p

  • xt | q, xt−1

t−p

  • =
  • n κqn
  • xt−1

t−p | hq

  • k
  • xt−yn

hq0

  • hq0
  • n κqn
  • xt−1

t−p | hq

  • κqn
  • xt−1

t−p | hq

  • = wqn

p

  • l=1

k xt−l − yn−l hql

  • Gustav Eje Henter (CSTR)

KDE-HMMs for Speech Synthesis 2014-01-20 37

slide-39
SLIDE 39

KDE-HMM Properties

Advantages:

  • Flexible short-range correlation modelling
  • Hidden state allows output control
  • Context-dependent bandwidths

Disadvantages:

  • Data requirements
  • Computational cost

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 38

slide-40
SLIDE 40

Context-Dependent Bandwidths

Single bandwidth is too coarse in the center, because of the sparse edges

Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 39

slide-41
SLIDE 41

Context-Dependent Bandwidths

Data points coloured according to estimated instantaneous phase Current value xt Subsequent value xt+1 50 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 40

slide-42
SLIDE 42

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 41

slide-43
SLIDE 43

Parameter Estimation

Standard techniques apply to derive expectation maximisation (EM) update equations for bandwidths and weights

  • Auxiliary function

Q

  • θ′ |

θ

  • = . . . + 1

2

  • q, t, n=t

γqt̺num

qnt

  • ln 1

h′

q0

− 1 h′2

q0

(xt − yn)2

  • +
  • q, t, n=t

γqt̺num

qnt

  • ln w′

qn − 1

2

p

  • l=1

1 h′2

ql

(xt−l − yn−l)2

  • q, t

γqt ln  

n=t

w′

qn exp

  • −1

2

p

  • l=1

1 h′2

ql

(xt−l − yn−l)2  

  • Negative log-sum-exp term due to conditioning is an issue

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 42

slide-44
SLIDE 44

Handling Log-Sum-Exp

1 Extended Baum-Welch (EBW) heuristic from discriminative training

  • Guaranteed ascent for small step lengths (nonconstructive proof)

2 Minorise-maximisation

  • Optimise a locally tight lower bound

Q

  • θ′ |

θ

  • ≤ Q
  • θ′ |

θ

  • Such bounds can have the same form as other terms in Q using

reverse-Jensen inequalities (Jebara, 2002) − ln  

n=t

wqn exp p

  • l=1

Tnl (xt−l) 1 h′2

ql

− K

  • h′

q

 ≥

  • n=t

ωqtn p

  • l=1

Utnl (xt−l) 1 h′2

ql

− K

  • h′

q

  • − kqt
  • Modified sufficient statistics Utnl and weights ωqtn depend on current

parameter values hq

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 43

slide-45
SLIDE 45

Minorise-Maximisation Updates

One obtains a regularisation of the hq0 update formula:

  • h2(new)

ql

= Wq h2

ql + t, n=tγqt

  • ̺num

qnt −̺den qnt

  • (xt−l − yn−l)2

Wq +

t, n=tγqt

  • ̺num

qnt −̺den qnt

  • Dependence on previous estimate

h2

ql through local bound

  • Similar formula for updated weights

w(new)

qn

  • “Brake weights” Wq restrict update step length
  • Large weights slow convergence

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 44

slide-46
SLIDE 46

Releasing the Brakes

1 Best reverse-Jensen bounds

  • Guaranteed ascent, but impossibly conservative, e.g.,

Wq ≫ 103 ·

  • t, n=t

γqt

  • ̺num

qnt − ̺den qnt

  • 2 Less conservative weights are possible
  • Use approximations related to EBW heuristics (Afify, 2005)
  • Fix wqn, only update bandwidths
  • Reduced total weight, e.g.,

Wq ≈ 4 ·

  • t, n=t γqt
  • ̺num

qnt − ̺den qnt

  • Always increase likelihood in experiments

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 45

slide-47
SLIDE 47

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 46

slide-48
SLIDE 48

Evaluation

Context-sensitive bandwidth improves on KDE-MMs (N = 3000) Markov model order p Test set per-sample log-probability 2 4 6 8 10 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 KDE-MM AR model KDE-HMM M=1

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 47

slide-49
SLIDE 49

Evaluation

KDE-HMMs yield greater model accuracy than linear AR-HMMs

Number of HMM states M Held-out set per-sample log-probability 5 10 15 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 Gaussian HMM AR-HMM p=1 AR-HMM p=2 AR-HMM p=3 KDE-HMM p=0 KDE-HMM p=1 KDE-HMM p=2 KDE-HMM p=3

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 48

slide-50
SLIDE 50

Reference Data

Excerpt from original laser data-series Time index t Laser intensity xt 100 150 200 250 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 49

slide-51
SLIDE 51

Sample Output

Sample from best linear AR-HMM (p = 3, M = 15 states) Time index t Output value xt 50 100 150 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 50

slide-52
SLIDE 52

Sample Output

Sample from best KDE-HMM (p = 3, M = 15) Time index t Output value xt 50 100 150 50 100 150 200 250

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 51

slide-53
SLIDE 53

Second Dataset

KDE-HMMs are superior to other models also on ECG data (N = 3000)

Number of HMM states M Held-out set per-sample log-probability 5 10 15 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 Gaussian HMM AR-HMM p=1 AR-HMM p=2 AR-HMM p=3 KDE-HMM p=0 KDE-HMM p=1 KDE-HMM p=2 KDE-HMM p=3

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 52

slide-54
SLIDE 54

Reference Data

Excerpt from ECG data: empirical standard deviation σECG ≈ 109 Time index t ECG ADC value xt 300 400 500 600 −200 −100 100 200 300 400 500 600

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 53

slide-55
SLIDE 55

Sample Output

Sample from best linear AR-HMM (p = 3, M = 15): σAR ≈ 2490(!) Time index t Output value xt 50 100 150 200 250 300 −200 200 400 600

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 54

slide-56
SLIDE 56

Sample Output

Sample from best KDE-HMM (p = 2, M = 13): σKDE ≈ 94.3 Time index t Output value xt 50 100 150 200 250 300 −200 200 400 600

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 55

slide-57
SLIDE 57

Outline

1 Introduction 2 Kernel density estimation 3 KDE Markov models

  • Experiments

4 KDE-HMMs

  • Parameter estimation
  • Experiments

5 Summary and outlook

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 56

slide-58
SLIDE 58

Successes

1 Theoretically powerful time-series model

  • Nonparametric, asymptotically consistent

2 Parameter update formulas 3 Better modelling of difficult nonlinear series than linear AR-HMMs 4 Compelling for signal synthesis

  • Converges on the true distribution
  • Probabilistic hybrid speech synthesis

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 57

slide-59
SLIDE 59

Future Possibilities

  • Apply to speech
  • Glottal source data
  • Single utterance synthesis
  • Also train point-to-state assignments wqn (realignment)
  • Adapt additional EBW heuristics from Woodland & Povey (2002)
  • Reduce sample complexity from the infeasible O
  • N2
  • Approximate kernel evaluations using, e.g., dual trees (Holmes, Gray, &

Isbell, 2007)

  • Pseudo-likelihood maximisation is unsuitable
  • KDE methods are more developed for integrated square error
  • Unlike recognition, synthesis prioritises peaks rather than tails

Gustav Eje Henter (CSTR) KDE-HMMs for Speech Synthesis 2014-01-20 58

slide-60
SLIDE 60

The End

slide-61
SLIDE 61

The End

Thank you for listening!