automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 Images on the first 11 slides are from Zen et al., Statistical Parametric


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical Parametric Speech Synthesis Instructor: Preethi Jyothi Nov 2, 2017 
 Images on the first 11 slides are from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

  2. Parametric Speech Synthesis Framework O ô Speech 
 Speech 
 Train 
 Parameter 
 speech Synthesis Analysis Model Generation ˆ W λ Text 
 Text 
 text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ

  3. Parametric Speech Synthesis Framework O ô Speech 
 Speech 
 Train 
 Parameter 
 speech Synthesis Analysis Model Generation ˆ W λ Text 
 Text 
 text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) HMMs! λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be synthesised, w • λ p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •

  4. HMM-based speech synthesis amount Speech signal SPEECH w DATABASE weights Excitation Spectral parameter parameter syn- extraction extraction heuris- Excitation Spectral parameters parameters the Training of HMM Labels concatenation-cost pro- Training part and Synthesis part TEXT ) context-dependent HMMs implementations Text analysis & duration models All Labels Parameter generation from HMM that Excitation Spectral parameters parameters Excitation Synthesis SYNTHESIZED opti- generation filter SPEECH ger domain.

  5. Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ q = arg max ˆ λ ) q q, ˆ Let’s explore this first o = arg max ˆ p ( o | ˆ λ ) o

  6. Determining state outputs q, ˆ p ( o | ˆ o = arg max ˆ λ ) o N ( o ; µ ˆ = arg max q ) q , Σ ˆ o � ⊤ is a state-output vector sequence � o ⊤ 1 , . . . , o ⊤ where o = T to be generated, q = { q 1 , . . . , q T } is a state sequence, and � ⊤ is the mean vector for q . � µ ⊤ q 1 , . . . , µ ⊤ = Here, µ q q T = diag [ Σ , . . . , Σ ] is the covariance matrix for q and Σ What would look like? ˆ o Mean Variance

  7. Adding dynamic features to state outputs � ⊤ c ⊤ t , ∆ c ⊤ ∆ c t = c t − c t − 1 � where o t = t dynamic feature is calculated as 7 between and can State output vectors contain both static ( c t ) and dynamic ( Δ c t ) features o W c . . . . . ⎡ ⎤ ⎡ ⎤ . . . . . . · · · . . . . · · · . ⎡ ⎤ . ⎢ ⎥ ⎢ ⎥ . · · · · · · c t − 1 I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − 1 c t − 2 − I I 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t − 1 c t I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t 0 − I I 0 c t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t +1 0 0 0 I c t +1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . · · · · · · ∆ c t +1 0 0 − I I . ⎢ ⎥ ⎢ ⎥ . ⎣ . ⎦ ⎣ . . . . ⎦ . . . . . . · · · . . . . · · · (17) o and c arranged in matrix form

  8. Speech parameter generation Introducing dynamic feature constraints: • q, ˆ o = arg max ˆ p ( o | ˆ λ ) where o = Wc o If the output distributions are single Gaussians: • q, ˆ p ( o | ˆ λ ) = N ( o ; µ ˆ q ) q , Σ ˆ Then, by se tu ing we get: ∂ log N ( o ; µ ˆ • q , Σ ˆ q ) / ∂ c = 0 W T Σ − 1 q Wc = W T Σ − 1 q µ ˆ q ˆ ˆ

  9. Synthesis overview Clustered states Merged states Sentence HMM Static Delta Gaussian ML trajectory

  10. Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ Let’s explore this next q = arg max ˆ λ ) q q, ˆ o = arg max ˆ p ( o | ˆ λ ) o

  11. Duration modeling How are durations • 0.4 modelled within an State duration probability p k ( d ) p k ( d ) = a d − 1 kk (1 − a kk ) HMM? ( a kk = 0.6 ) 0.3 Implicitly modelled by state • 0.2 self-transition probabilities p k ( d ) = a d − 1 · (1 − a kk ) 0.1 kk PMFs of state durations are • 0.0 geometric distributions 1 2 3 4 5 6 7 8 9 10 State duration d (frame) State durations are determined by maximising: • N X log P ( d | λ ) = log p j ( d j ) , j = 1 What would this solution look like if the PMFs of state durations are • geometric distributions?

  12. Explicit modeling of state durations Each state duration is explicitly modelled as a single Gaussian. • The mean and variance of duration density of state i: σ 2 ( i ) ξ ( i ) T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) t 0 =1 t 1 = t 0 ξ ( i ) = , T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) 2 t 0 =1 t 1 = t 0 σ 2 ( i ) − ξ 2 ( i ) , = T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 1 where t 1 � χ t 0 ,t 1 ( i ) = (1 − γ t 0 − 1 ( i )) · γ t ( i ) · (1 − γ t 1 +1 ( i )) , t = t 0 and γ t ( i ) is the probability of being in state i at time t

  13. Determining state durations During synthesis, for a given speech length T, the goal is to maximize: K X log P ( d | λ , T ) = log p k ( d k ) … (1) k =1 K X under the constraint that T = d k k =1 We saw that each duration density can be modelled as a single 
 p k ( d k ) N ( · ; ξ k , σ 2 Gaussian k ) State durations, , which maximise (1) are given by: d k , k = 1 . . . K ξ ( k ) + ρ · σ 2 ( k ) = d k Speaking 
 �� K � K rate � � σ 2 ( k ) = ξ ( k ) ρ T − k =1 k =1 where and are the mean and variance of the

  14. Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or ρ State Duration d d 1 2 c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T MLSA Filter Pitch the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-

  15. DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 3 13 23 33 y 1 3 x 1 h 1 h 1 h 1 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric 1 x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  16. DNN-based speech synthesis Input feature Text TEXT Input features about linguistic extraction analysis • Input layer Hidden layers Output layer contexts, numeric values (# of words, x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 duration of the phoneme, etc.) features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 y 1 2 x 1 h 1 h 1 h 1 Output features are spectral and • 3 13 23 33 y 1 3 excitation parameters and their 
 x 1 h 1 h 1 h 1 4 14 24 34 delta values ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 features at frame T y T binary & numeric Listening test results 1 • x T h T h T h T 2 12 22 32 y T 2 x T h T h T h T 3 13 23 33 Table 1 . Preference scores (%) between speech samples from the y T 3 HMM and DNN-based systems. The systems which achieved sig- x T h T h T h T 4 14 24 34 nificantly better preference at p < 0.01 level are in the bold font. HMM DNN Waveform Parameter SPEECH ( ˛ ) (#layers � #units) Neutral p value z value synthesis generation < 10 � 6 15.8 (16) 38.5 (4 � 256) 45.7 -9.9 < 10 � 6 16.1 (4) 27.2 (4 � 512) 56.8 -5.1 < 10 � 6 12.7 (1) 36.6 (4 � 1 024) 50.7 -11.5 Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  17. RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive; 
 • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend