Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017   Images on the first 11 slides are from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

Parametric Speech Synthesis Framework O ô Speech   Speech   Train   Parameter   speech Synthesis Analysis Model Generation ˆ W λ Text   Text   text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ

Parametric Speech Synthesis Framework O ô Speech   Speech   Train   Parameter   speech Synthesis Analysis Model Generation ˆ W λ Text   Text   text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) HMMs! λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be synthesised, w • λ p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •

HMM-based speech synthesis amount Speech signal SPEECH w DATABASE weights Excitation Spectral parameter parameter syn- extraction extraction heuris- Excitation Spectral parameters parameters the Training of HMM Labels concatenation-cost pro- Training part and Synthesis part TEXT ) context-dependent HMMs implementations Text analysis & duration models All Labels Parameter generation from HMM that Excitation Spectral parameters parameters Excitation Synthesis SYNTHESIZED opti- generation filter SPEECH ger domain.

Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ q = arg max ˆ λ ) q q, ˆ Let’s explore this first o = arg max ˆ p ( o | ˆ λ ) o

Determining state outputs q, ˆ p ( o | ˆ o = arg max ˆ λ ) o N ( o ; µ ˆ = arg max q ) q , Σ ˆ o � ⊤ is a state-output vector sequence � o ⊤ 1 , . . . , o ⊤ where o = T to be generated, q = { q 1 , . . . , q T } is a state sequence, and � ⊤ is the mean vector for q . � µ ⊤ q 1 , . . . , µ ⊤ = Here, µ q q T = diag [ Σ , . . . , Σ ] is the covariance matrix for q and Σ What would look like? ˆ o Mean Variance

Adding dynamic features to state outputs � ⊤ c ⊤ t , ∆ c ⊤ ∆ c t = c t − c t − 1 � where o t = t dynamic feature is calculated as 7 between and can State output vectors contain both static ( c t ) and dynamic ( Δ c t ) features o W c . . . . . ⎡ ⎤ ⎡ ⎤ . . . . . . · · · . . . . · · · . ⎡ ⎤ . ⎢ ⎥ ⎢ ⎥ . · · · · · · c t − 1 I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − 1 c t − 2 − I I 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t − 1 c t I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t 0 − I I 0 c t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t +1 0 0 0 I c t +1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . · · · · · · ∆ c t +1 0 0 − I I . ⎢ ⎥ ⎢ ⎥ . ⎣ . ⎦ ⎣ . . . . ⎦ . . . . . . · · · . . . . · · · (17) o and c arranged in matrix form

Speech parameter generation Introducing dynamic feature constraints: • q, ˆ o = arg max ˆ p ( o | ˆ λ ) where o = Wc o If the output distributions are single Gaussians: • q, ˆ p ( o | ˆ λ ) = N ( o ; µ ˆ q ) q , Σ ˆ Then, by se tu ing we get: ∂ log N ( o ; µ ˆ • q , Σ ˆ q ) / ∂ c = 0 W T Σ − 1 q Wc = W T Σ − 1 q µ ˆ q ˆ ˆ

Synthesis overview Clustered states Merged states Sentence HMM Static Delta Gaussian ML trajectory

Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ Let’s explore this next q = arg max ˆ λ ) q q, ˆ o = arg max ˆ p ( o | ˆ λ ) o

Duration modeling How are durations • 0.4 modelled within an State duration probability p k ( d ) p k ( d ) = a d − 1 kk (1 − a kk ) HMM? ( a kk = 0.6 ) 0.3 Implicitly modelled by state • 0.2 self-transition probabilities p k ( d ) = a d − 1 · (1 − a kk ) 0.1 kk PMFs of state durations are • 0.0 geometric distributions 1 2 3 4 5 6 7 8 9 10 State duration d (frame) State durations are determined by maximising: • N X log P ( d | λ ) = log p j ( d j ) , j = 1 What would this solution look like if the PMFs of state durations are • geometric distributions?

Explicit modeling of state durations Each state duration is explicitly modelled as a single Gaussian. • The mean and variance of duration density of state i: σ 2 ( i ) ξ ( i ) T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) t 0 =1 t 1 = t 0 ξ ( i ) = , T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) 2 t 0 =1 t 1 = t 0 σ 2 ( i ) − ξ 2 ( i ) , = T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 1 where t 1 � χ t 0 ,t 1 ( i ) = (1 − γ t 0 − 1 ( i )) · γ t ( i ) · (1 − γ t 1 +1 ( i )) , t = t 0 and γ t ( i ) is the probability of being in state i at time t

Determining state durations During synthesis, for a given speech length T, the goal is to maximize: K X log P ( d | λ , T ) = log p k ( d k ) … (1) k =1 K X under the constraint that T = d k k =1 We saw that each duration density can be modelled as a single   p k ( d k ) N ( · ; ξ k , σ 2 Gaussian k ) State durations, , which maximise (1) are given by: d k , k = 1 . . . K ξ ( k ) + ρ · σ 2 ( k ) = d k Speaking   �� K � K rate � � σ 2 ( k ) = ξ ( k ) ρ T − k =1 k =1 where and are the mean and variance of the

Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or ρ State Duration d d 1 2 c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T MLSA Filter Pitch the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-

DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 3 13 23 33 y 1 3 x 1 h 1 h 1 h 1 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric 1 x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

DNN-based speech synthesis Input feature Text TEXT Input features about linguistic extraction analysis • Input layer Hidden layers Output layer contexts, numeric values (# of words, x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 duration of the phoneme, etc.) features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 Output features are spectral and • 3 13 23 33 y 1 3 excitation parameters and their   x 1 h 1 h 1 h 1 4 14 24 34 delta values ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric Listening test results 1 • x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 Table 1 . Preference scores (%) between speech samples from the y T 3 HMM and DNN-based systems. The systems which achieved sig- x T h T h T h T 4 14 24 34 nificantly better preference at p < 0.01 level are in the bold font. HMM DNN Waveform Parameter SPEECH ( ˛ ) (#layers � #units) Neutral p value z value synthesis generation < 10 � 6 15.8 (16) 38.5 (4 � 256) 45.7 -9.9 < 10 � 6 16.1 (4) 27.2 (4 � 512) 56.8 -5.1 < 10 � 6 12.7 (1) 36.6 (4 � 1 024) 50.7 -11.5 Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive;   • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 Images on the first 11 slides are from Zen et al., Statistical Parametric

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische

Dialogue systems & chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

Command Support Research Overview Leonard Adelman C4I Center Review May 19, 2006 Research

Theoretical Foundations of the UML Gao Complexity Lecture 09: Realisability Joost-Pieter

Review of vector terms I A D -vector over F is a function with domain D and co-domain F . F must be

SALAD/TAPS at KVI 1995-1997 trigger TAPS at GANIL 1997-1998 trigger, readout electronics TAPS

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 Images on the first 11 slides are from Zen et al., Statistical Parametric

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische

Dialogue systems &amp; chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

Command Support Research Overview Leonard Adelman C4I Center Review May 19, 2006 Research

Theoretical Foundations of the UML Gao Complexity Lecture 09: Realisability Joost-Pieter

Review of vector terms I A D -vector over F is a function with domain D and co-domain F . F must be

SALAD/TAPS at KVI 1995-1997 trigger TAPS at GANIL 1997-1998 trigger, readout electronics TAPS

Dialogue systems & chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)