A template-based approach for speech synthesis intonation generation - - PowerPoint PPT Presentation

a template based approach for speech synthesis intonation
SMART_READER_LITE
LIVE PREVIEW

A template-based approach for speech synthesis intonation generation - - PowerPoint PPT Presentation

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon Introduction: Statistical speech synthesis SPSS: Generation based approaches Vocoder MGC BAP Regression Text


slide-1
SLIDE 1

A template-based approach for speech synthesis intonation generation using LSTMs

Srikanth Ronanki

Gustav Zhizheng Simon

slide-2
SLIDE 2

Introduction: Statistical speech synthesis

Front-end Text Regression Model Waveform generator Speech

MGC BAP LF0

Vocoder

  • SPSS: Generation based approaches

Pitch

slide-3
SLIDE 3

Why template-based approach?

  • Lack of convincing intonation makes current parametric systems

sound dull and lifeless.

  • Typically, these systems predict F0 frame-by-frame using regression

models.

  • This approach leads to overly-smooth pitch contours and fail to

construct an appropriate prosodic structure.

  • Templates retain the dynamic range of F0 within the segment.
  • We propose a classification-based approach to automatic F0

generation.

slide-4
SLIDE 4

Pitch contour

slide-5
SLIDE 5

Pitch contour segmentation

slide-6
SLIDE 6

Hierarchical clustering

Pitch contours

slide-7
SLIDE 7

How to determine number of clusters?

slide-8
SLIDE 8

A set of templates (clusters)

slide-9
SLIDE 9

Intonation reconstruction from templates

2 4 5 4 2 4 4 5 /g@Ulwd/ /lI/ /bE@z/ /Tri/ /D@/ /Qks/ /and/ /pau/ Goldilocks and the three bears

Duration Mean pitch

2 4 5 4 4 4 2 5

slide-10
SLIDE 10

Intonation reconstruction

slide-11
SLIDE 11

Hierarchical clustering - Recap

Training Data segmentation (syllable) Pitch patterns (DCT features) Clustering (Hierarchical)

force-aligned durations mean normalisation duration normalisation

  • Interpolate the F0 contour of each utterance 


and segment into syllables

  • Apply DCT based decomposition: 


c0 representing the mean over syllable, 
 c = [c1,…,CN-1], representing the shape 


  • f the contour
  • Perform top-to-bottom hierarchical clustering

  • ver the patterns (c).
slide-12
SLIDE 12

Proposed approaches

Neural Network classifiers:

  • A hierarchical deep neural network classifier (HC).
  • The first DNN choses between flat and non-flat template.
  • The second DNN choses among rest of the non-flat templates.
  • A simplified LSTM with a CTC output layer (CTC).
  • Connectionist temporal classification coupled with S-LSTM to

predict the sequence of templates given sequence of phonemes.

Syllable Pitch contour 17.5 35 52.5 70 1 2 3 4 5 6

template counts in the data

slide-13
SLIDE 13

it

  • t

ft Text Text analysis Linguistic features Ct 1 − ft d(t) Phone durations CTC output layer Frame-level linguistic features Acoustic S-LSTM Output layer MGC BAP F0 Vocoder Syllable templates Acoustic features Phone-level Waveform F0 reconstruction Frame-level F0 Duration S-LSTM Intonation S-LSTM linguistic features

Input #1 Input #2 Input #3 Input #1 Input #2 Input #3

smoothing

slide-14
SLIDE 14

Results: systems

  • Baseline system
  • MSE - A frame-wise regression baseline predicting F0 using LSTMs.
  • Proposed systems
  • HC - A hierarchical deep neural network classifier
  • CTC - A simplified LSTM coupled with CTC output layer
  • Oracle - A oracle system using templates derived from natural F0 contour

but with predicted F0 mean and duration

slide-15
SLIDE 15

Objective evaluation

  • Classification measures
  • Accuracy - percentage of templates correctly classified
  • F1 score - is a measure of test’s accuracy (precision and recall)

Model Accuracy F1 score HC 61.1% 0.590 CTC 63.8% 0.593

slide-16
SLIDE 16

Objective evaluation

  • F0 prediction measures
  • RMSE - Root mean square error
  • CORR - Pearson correlation

40 41.75 43.5 45.25 47 MSE HC CTC Oracle

Fig: RMSE of predicted F0

0.15 0.3 0.45 0.6 MSE HC CTC Oracle

Fig: Correlation of predicted F0

  • Oracle templates + Oracle F0 mean - 0.89 (corr.)
slide-17
SLIDE 17

Subjective evaluation

  • Reference systems
  • MSE - A frame-wise regression baseline predicting F0 using LSTMs.
  • BOT - A bottom line using piecewise-constant F0 per syllable (the mean

natural F0)

  • BMK - A benchmark system using force-aligned durations and natural F0

contours

  • VOC - A top line of vocoded speech (STRAIGHT in this work)
slide-18
SLIDE 18

Subjective evaluation: MUSHRA

Subjective rank (highest is best) VOC BOT BMK MSE HC CTC Oracle 1 2 3 4 5 6 7

Fig: Box plot of aggregate ranks from listening test. Red lines are medians, orange squares means.

  • 20 listeners
  • 20 out of 32 test 


stimuli

slide-19
SLIDE 19

Summary and conclusions

  • A classification approach to intonation prediction with syllable F0

templates

  • Proposed approach matches the performance of conventional

approach

  • Has potential to exceed it once the issues with oracle template

system are overcome

  • Future work:
  • Better smoothing techniques and word-level templates
  • Use the prediction probabilities as features for frame-level

regression approaches

slide-20
SLIDE 20

Code

  • Code for templates and clustering
  • https://github.com/ronanki/Hybrid_prosody_model
  • Code for training neural networks
  • https://github.com/CSTR-Edinburgh/merlin