a template based approach for speech synthesis intonation
play

A template-based approach for speech synthesis intonation generation - PowerPoint PPT Presentation

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon Introduction: Statistical speech synthesis SPSS: Generation based approaches Vocoder MGC BAP Regression Text


  1. A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon

  2. Introduction: Statistical speech synthesis SPSS: Generation based approaches • Vocoder MGC BAP Regression Text Front-end Pitch Model LF0 Speech Waveform generator

  3. Why template-based approach? Lack of convincing intonation makes current parametric systems • sound dull and lifeless. Typically, these systems predict F0 frame-by-frame using regression • models. This approach leads to overly-smooth pitch contours and fail to • construct an appropriate prosodic structure. Templates retain the dynamic range of F0 within the segment. • We propose a classification-based approach to automatic F0 • generation.

  4. Pitch contour

  5. Pitch contour segmentation

  6. Hierarchical clustering Pitch contours

  7. How to determine number of clusters?

  8. A set of templates (clusters)

  9. Intonation reconstruction from templates Goldilocks and the three bears /g@Ulwd/ /lI/ /Qks/ /pau/ /and/ /D@/ /Tri/ /bE@z/ 2 4 5 4 4 4 2 5 Mean pitch 2 4 5 4 4 4 2 5 Duration

  10. Intonation reconstruction

  11. Hierarchical clustering - Recap Training Data force-aligned durations • Interpolate the F0 contour of each utterance 
 and segment into syllables segmentation (syllable) • Apply DCT based decomposition: 
 duration c0 representing the mean over syllable, 
 normalisation c = [c1,…,CN-1], representing the shape 
 of the contour Pitch patterns (DCT features) • Perform top-to-bottom hierarchical clustering 
 mean normalisation over the patterns ( c ). Clustering (Hierarchical)

  12. Proposed approaches Pitch Syllable contour 70 52.5 35 Neural Network classifiers: 17.5 0 1 2 3 4 5 6 A hierarchical deep neural network classifier (HC). • template counts in the data ‣ The first DNN choses between flat and non-flat template. ‣ The second DNN choses among rest of the non-flat templates. A simplified LSTM with a CTC output layer (CTC). • ‣ Connectionist temporal classification coupled with S-LSTM to predict the sequence of templates given sequence of phonemes.

  13. Waveform smoothing Vocoder MGC BAP F 0 Acoustic features Output layer o t i t C t Acoustic S-LSTM 1 − f t f t Input Input Input #1 #2 #3 Frame-level F0 Frame-level linguistic features F0 reconstruction d ( t ) Phone durations Syllable templates CTC output layer Intonation S-LSTM Duration S-LSTM Phone-level linguistic features Input Input Input #1 #2 #3 Linguistic features Text analysis Text

  14. Results: systems Baseline system • ‣ MSE - A frame-wise regression baseline predicting F0 using LSTMs. Proposed systems • ‣ HC - A hierarchical deep neural network classifier ‣ CTC - A simplified LSTM coupled with CTC output layer ‣ Oracle - A oracle system using templates derived from natural F0 contour but with predicted F0 mean and duration

  15. Objective evaluation Classification measures • ‣ Accuracy - percentage of templates correctly classified ‣ F1 score - is a measure of test’s accuracy (precision and recall) Model Accuracy F1 score HC 61.1% 0.590 CTC 63.8% 0.593

  16. Objective evaluation F0 prediction measures • ‣ RMSE - Root mean square error ‣ CORR - Pearson correlation 47 0.6 45.25 0.45 43.5 0.3 0.15 41.75 0 40 MSE HC CTC Oracle MSE HC CTC Oracle Fig: Correlation of predicted F0 Fig: RMSE of predicted F0 • Oracle templates + Oracle F0 mean - 0.89 (corr.)

  17. Subjective evaluation Reference systems • ‣ MSE - A frame-wise regression baseline predicting F0 using LSTMs. ‣ BOT - A bottom line using piecewise-constant F0 per syllable (the mean natural F0) ‣ BMK - A benchmark system using force-aligned durations and natural F0 contours ‣ VOC - A top line of vocoded speech (STRAIGHT in this work)

  18. Subjective evaluation: MUSHRA 7 Subjective rank (highest is best) 6 5 20 listeners • 20 out of 32 test 
 • 4 stimuli 3 2 1 VOC BOT BMK MSE HC CTC Oracle Fig: Box plot of aggregate ranks from listening test. Red lines are medians, orange squares means.

  19. Summary and conclusions A classification approach to intonation prediction with syllable F0 • templates Proposed approach matches the performance of conventional • approach Has potential to exceed it once the issues with oracle template • system are overcome Future work: • ‣ Better smoothing techniques and word-level templates ‣ Use the prediction probabilities as features for frame-level regression approaches

  20. Code Code for templates and clustering • ‣ https://github.com/ronanki/Hybrid_prosody_model Code for training neural networks • ‣ https://github.com/CSTR-Edinburgh/merlin

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend