Non-parametric duration modelling for speech synthesis with a joint - PowerPoint PPT Presentation

Non-parametric duration modelling for speech synthesis with a joint model of acoustics and duration Gustav Eje Henter 1 , Srikanth Ronanki 2 , Oliver Watts 2 , and Simon King 2 1 Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo 2 The Centre for Speech Technology Research (CSTR), The University of Edinburgh, UK Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 1 / 40

Graphical overview Phone level Frame level Text Phone Duration modelling Conventional features durations Speech Acoustic modelling parameters Text Duration modelling features Proposed Transition probabilities Speech Acoustic modelling parameters Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 2 / 40

Key takeaways • Innovations 1. Train an RNN/DNN to predict per-frame transition probabilities 2. Generate durations using median or other distribution quantiles • Advantages • Non-parametric – can model any duration distribution shape! • Predicts acoustics and durations in tandem • Is a proper hidden semi-Markov model Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 3 / 40

Outline 1. Background 2. Formal specification 3. Experiments 4. Extensions Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 4 / 40

Motivation • Prosody remains a major shortcoming of TTS • Duration is an important prosodic component • State-of-the-art (Gaussian) duration models: • Allow non-positive durations • Do not sum to one on the integers (unnormalised) • Do not account for skewness • Are separate from the acoustic model Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 5 / 40

Real durations Forced-aligned durations from dataset vs. fitted Gaussian 0.07 Gaussian fit Mean=19.83 0.06 Median=16 0.05 Probability 0.04 0.03 0.02 0.01 0.00 0 20 40 60 80 100 Phone duration d p (frames) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 6 / 40

Statistical TTS Statistical parametric speech synthesis requires three components: 1. A stochastic distribution family f D ( d ; θ ) for durations D 2. A machine-leaning predictor θ ( l ) • l are text-derived linguistic features • Predicts how duration distributions depend on text • Is learned from training data (statistical) 3. A duration-generation principle • Mean-based generation � d = E ( D | l ) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 7 / 40

HMM-based TTS Speech is generated by a hidden Markov model (HMM) • Hidden-state models specified by: • Emissions: f O | S ( o | s ) (acoustic observations o ) • Transition probability: P ( S t + 1 = s + 1 | S t = s ) (durations) • State transitions follow a Markov process • State S t tracks sub-phone time evolution • Training (EM-algorithm) is linear in sequence length Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 8 / 40

HMM-based durations 1. Geometric duration distribution f D ( d ; a ) = a ( 1 − a ) d − 1 • Implicit consequence of fixed HMM transition probability a • Memoryless (unrealistic) 2. Regression tree (RT) predictor a ( l ) 3. Mean-based generation � 1 d = E ( D | l ) ∝ a ( l ) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 9 / 40

HSMM-based TTS Change to a hidden-semi Markov model (HSMM) (Zen et al., 2004) • Model specified by: • Emissions: f O | S ( o | s ) • Unchanged • Transition probability: P ( S t + 1 = s + 1 | S t = s , n t ) • Can now depend on n t , time spent in current state • This is a semi-Markov process • Training complexity is now quadratic in sequence length Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 10 / 40

HSMM-based durations 1. Any parametric distribution f D ( d ; θ ) possible! � d ; µ, σ 2 � • Gaussian distribution f D ( d ; θ ) = f N standard in HTS (Zen et al., 2007) • Log-normal (Campbell, 1989) or gamma (Huber, 1990) 2. Regression tree (RT) predictor θ ( l ) • Unchanged 3. Mean-based generation � d = E ( D | l ) = � µ ( l ) • Unchanged Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 11 / 40

NN-based durations 1. Gaussian distribution f D ( d ; θ ) = f N ( d ; µ, σ 2 ) • Unchanged 2. Deep or recurrent neural network µ ( l ) • DNNs/RNNs are more successful practical predictors • Typically, only µ is predicted (minimum MSE) 3. Mean-based generation � d = E ( D | l ) = � µ ( l ) • Unchanged Note: Data is forced-aligned using HMM/HSMM before training Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 12 / 40

Approaches in review f D ( d ; θ ) Pred. θ ( l ) TTS type Level Generation Formant - Phone - Rule Concat. - Phone - Exemplar HMM Geom. State RT Mean HSMM Param. State RT Mean NN Gauss. State NN Mean Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 13 / 40

Approaches in review f D ( d ; θ ) Pred. θ ( l ) TTS type Level Generation Formant - Phone - Rule Concat. - Phone - Exemplar HMM Geom. State RT Mean HSMM Param. State RT Mean NN Gauss. State NN Mean Proposed Non-par. ≤ Frame NN Quantile Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 13 / 40

Proposed approach 1. General categorical distribution f D ( d ) • Not restricted to a specific parametric form 2. Deep or recurrent neural network • Predicts a transition probability for each time unit (e.g., frame) • Runs in tandem with acoustic model 3. Quantile-based generation • Can be computed using P ( D ≤ d ) , the left tail of f D , only • Median duration: Special case more probable than mean • Benefits from statistical robustness (Henter et al., 2016) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 14 / 40

Outline 1. Background 2. Formal specification 3. Experiments 4. Extensions Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 15 / 40

Preliminaries • p ∈ { 1 , . . . , P } is a phone/state index • t ∈ { 1 , . . . , T } is a time-step (frame) index • D p is the (stochastic) duration of phone/state p • Outcome values d p ∈ Z > 0 • l p collects the per-phone linguistic features � � d 1 , . . . , � � • The task is to generate durations: ( l 1 , . . . , l P ) → d P Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 16 / 40

Conventional setup • Phone-level dataset D p = (( l 1 , . . . , l P ) , ( d 1 , . . . , d P )) • L p denotes the linguistic information influencing predictor at p • L p = ( l 1 , . . . , l p ) for a unidirectional RNN • Phone-level DNN/RNN d ( L p ; W ) predicts duration directly • NN weights W chosen to minimise MSE prediction error � � ( d p − d ( L p ; W )) 2 W ( D p ) = argmin W p ∈D p • The theoretical MSE minimiser is the expected duration • Frame-level acoustic modelling is a separate stage Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 17 / 40

Frame-level data • Frame-level sequence of linguistic features � � L t = ( l 1 , . . . , l t ) = l p ( 1 ) , . . . , l p ( t ) • p ( t ) is the current phone at frame t • t 0 is the end frame of the previous phone • The current phone has lasted n t = t − t 0 frames • Define per-frame indicator variables � � x t = I n t = d p ( t ) • Equal one if t is the last frame of phone p ( t ) , and zero otherwise • Frame-level dataset D t = ( L T , ( x 1 , . . . , x T )) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 18 / 40

Example Example of binary x t sequence from database utterance 1.0 0.8 Binary transition indicator x t 0.6 0.4 0.2 0.0 0 50 100 150 200 250 300 350 Time t (frames) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 19 / 40

Transition probabilities • Idea: Consider the transition probability π t = π ( L t ) = P ( D p = n t | D p ≥ n t , L t ) • 1 − π t is the probability to remain in the same phone/state • This defines an unambiguous, proper duration distribution t 0 + n t − 1 � P ( D p = n t | L t ) = π ( L t ) ( 1 − π ( L t ′ )) t ′ = t 0 + 1 if and only if • π t ∈ [ 0 , 1 ] ∀ t • � ∞ t ′ = t 0 + 1 ( 1 − π t ′ ) = 0 when p ( t ′ ) constant • All distributions on the positive integers writeable like this Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 20 / 40

Non-parametric duration modelling for speech synthesis with a joint - PowerPoint PPT Presentation

Non-parametric duration modelling for speech synthesis with a joint model of acoustics and duration Gustav Eje Henter 1 , Srikanth Ronanki 2 , Oliver Watts 2 , and Simon King 2 1 Digital Content and Media Sciences Research Division, National

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Semi-parametric and response setup non-parametric approaches to Parametric models

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Behavioural Preorders on Stochastic Systems - Logical, Topological, and Computational Aspects

Markov process In the definition of a Markov process we stated that the next state only

I can interpret my answer in terms of the question. National 5 WB 26th Feb Statistics 3, 5, 5, 8,

Descriptive Statistics Chapter 3 1 Summarizing Data With lots of playtesting, there is a

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and CSBE,

Urban Computing Dr. Mitra Baratchi October 5, 2020 Leiden Institute of Advanced Computer Science

On the Total Variation Distance of SMPs Giorgio Bacci, Giovanni Bacci, Kim G. Larsen, Radu

Invariant, super and quasi-martingale functions of a Markov process Lucian Beznea Simion Stoilow