 
              Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 Images on the first 11 slides are from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009
Parametric Speech Synthesis Framework O ô Speech Speech Train Parameter speech Synthesis Analysis Model Generation ˆ W λ Text Text text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ
Parametric Speech Synthesis Framework O ô Speech Speech Train Parameter speech Synthesis Analysis Model Generation ˆ W λ Text Text text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) HMMs! λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be synthesised, w • λ p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •
HMM-based speech synthesis amount Speech signal SPEECH w DATABASE weights Excitation Spectral parameter parameter syn- extraction extraction heuris- Excitation Spectral parameters parameters the Training of HMM Labels concatenation-cost pro- Training part and Synthesis part TEXT ) context-dependent HMMs implementations Text analysis & duration models All Labels Parameter generation from HMM that Excitation Spectral parameters parameters Excitation Synthesis SYNTHESIZED opti- generation filter SPEECH ger domain.
Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ q = arg max ˆ λ ) q q, ˆ Let’s explore this first o = arg max ˆ p ( o | ˆ λ ) o
Determining state outputs q, ˆ p ( o | ˆ o = arg max ˆ λ ) o N ( o ; µ ˆ = arg max q ) q , Σ ˆ o � ⊤ is a state-output vector sequence � o ⊤ 1 , . . . , o ⊤ where o = T to be generated, q = { q 1 , . . . , q T } is a state sequence, and � ⊤ is the mean vector for q . � µ ⊤ q 1 , . . . , µ ⊤ = Here, µ q q T = diag [ Σ , . . . , Σ ] is the covariance matrix for q and Σ What would look like? ˆ o Mean Variance
Adding dynamic features to state outputs � ⊤ c ⊤ t , ∆ c ⊤ ∆ c t = c t − c t − 1 � where o t = t dynamic feature is calculated as 7 between and can State output vectors contain both static ( c t ) and dynamic ( Δ c t ) features o W c . . . . . ⎡ ⎤ ⎡ ⎤ . . . . . . · · · . . . . · · · . ⎡ ⎤ . ⎢ ⎥ ⎢ ⎥ . · · · · · · c t − 1 I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − 1 c t − 2 − I I 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t − 1 c t I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t 0 − I I 0 c t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t +1 0 0 0 I c t +1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . · · · · · · ∆ c t +1 0 0 − I I . ⎢ ⎥ ⎢ ⎥ . ⎣ . ⎦ ⎣ . . . . ⎦ . . . . . . · · · . . . . · · · (17) o and c arranged in matrix form
Speech parameter generation Introducing dynamic feature constraints: • q, ˆ o = arg max ˆ p ( o | ˆ λ ) where o = Wc o If the output distributions are single Gaussians: • q, ˆ p ( o | ˆ λ ) = N ( o ; µ ˆ q ) q , Σ ˆ Then, by se tu ing we get: ∂ log N ( o ; µ ˆ • q , Σ ˆ q ) / ∂ c = 0 W T Σ − 1 q Wc = W T Σ − 1 q µ ˆ q ˆ ˆ
Synthesis overview Clustered states Merged states Sentence HMM Static Delta Gaussian ML trajectory
Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ Let’s explore this next q = arg max ˆ λ ) q q, ˆ o = arg max ˆ p ( o | ˆ λ ) o
Duration modeling How are durations • 0.4 modelled within an State duration probability p k ( d ) p k ( d ) = a d − 1 kk (1 − a kk ) HMM? ( a kk = 0.6 ) 0.3 Implicitly modelled by state • 0.2 self-transition probabilities p k ( d ) = a d − 1 · (1 − a kk ) 0.1 kk PMFs of state durations are • 0.0 geometric distributions 1 2 3 4 5 6 7 8 9 10 State duration d (frame) State durations are determined by maximising: • N X log P ( d | λ ) = log p j ( d j ) , j = 1 What would this solution look like if the PMFs of state durations are • geometric distributions?
Explicit modeling of state durations Each state duration is explicitly modelled as a single Gaussian. • The mean and variance of duration density of state i: σ 2 ( i ) ξ ( i ) T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) t 0 =1 t 1 = t 0 ξ ( i ) = , T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) 2 t 0 =1 t 1 = t 0 σ 2 ( i ) − ξ 2 ( i ) , = T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 1 where t 1 � χ t 0 ,t 1 ( i ) = (1 − γ t 0 − 1 ( i )) · γ t ( i ) · (1 − γ t 1 +1 ( i )) , t = t 0 and γ t ( i ) is the probability of being in state i at time t
Determining state durations During synthesis, for a given speech length T, the goal is to maximize: K X log P ( d | λ , T ) = log p k ( d k ) … (1) k =1 K X under the constraint that T = d k k =1 We saw that each duration density can be modelled as a single p k ( d k ) N ( · ; ξ k , σ 2 Gaussian k ) State durations, , which maximise (1) are given by: d k , k = 1 . . . K ξ ( k ) + ρ · σ 2 ( k ) = d k Speaking �� K � K rate � � σ 2 ( k ) = ξ ( k ) ρ T − k =1 k =1 where and are the mean and variance of the
Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or ρ State Duration d d 1 2 c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T MLSA Filter Pitch the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-
DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 3 13 23 33 y 1 3 x 1 h 1 h 1 h 1 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric 1 x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
DNN-based speech synthesis Input feature Text TEXT Input features about linguistic extraction analysis • Input layer Hidden layers Output layer contexts, numeric values (# of words, x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 duration of the phoneme, etc.) features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 Output features are spectral and • 3 13 23 33 y 1 3 excitation parameters and their x 1 h 1 h 1 h 1 4 14 24 34 delta values ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric Listening test results 1 • x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 Table 1 . Preference scores (%) between speech samples from the y T 3 HMM and DNN-based systems. The systems which achieved sig- x T h T h T h T 4 14 24 34 nificantly better preference at p < 0.01 level are in the bold font. HMM DNN Waveform Parameter SPEECH ( ˛ ) (#layers � #units) Neutral p value z value synthesis generation < 10 � 6 15.8 (16) 38.5 (4 � 256) 45.7 -9.9 < 10 � 6 16.1 (4) 27.2 (4 � 512) 56.8 -5.1 < 10 � 6 12.7 (1) 36.6 (4 � 1 024) 50.7 -11.5 Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive; • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014
Recommend
More recommend