Hidden Markov ov Model (HMM) based S Speech Synthesis using ing - - PowerPoint PPT Presentation

Hidden Markov (HMM) based S Synthesis using

Presenter: Omer Nawaz Research Officer (III)

v Model

Speech ing HTS Toolkit.

SLIDE 2

Speech Synthesis Overvie

Text to be Synthesized Natural Language Processing (NLP) (NLP)

rview:

Speech Synthesis Engine Synthesized Speech

SLIDE 3

Introduction:

Rule-based, formant synthesis

Hand-crafting each phonetic units by rule

CORPUS-BASED: Concatenative synthesis

High quality speech can be synthesized concatenation algorithms. concatenation algorithms. To obtain various voices, a large amoun is necessary.

Statistical parametric synthesis

Generate speech parameters from stat Voice quality can easily be changed by HMM parameters. rules ized using waveform

unt of speech data

statistical models by transforming

SLIDE 4

Approaches at CLE:

CORPUS-BASED:

Unit Selection HMM based.

Comparison of two Approaches:

Unit Unit Unit Unit Selection Selection Selection Selection Advantages: Advantages: Advantages: Advantages: Advantages: Advantages: Advantages: Advantages:

High Quality at Waveform level (Specific Domain)

Sma
Smo
Stab

Disadvantages: Disadvantages: Disadvantages: Disadvantages:

Large footprints
Discontinuous
Unstable quality

Vocode (Domai

HMM based HMM based HMM based HMM based

Small Foot Print Smooth Stable Quality coder sound main-independent)

SLIDE 5

Synthesis Model:

Linear time-invariant system

) (n e

Excitation Pulse train

Source Source Source Source excitation part Vocal tract Vocal tract Vocal tract Vocal tract

urce Filter Model:

system

) (n h ) (n e

White noise

The h(n) is defined by the state output

mel-cepstrum

ant Speech

tract tract tract tract resonance part

) ( * ) ( ) ( n e n h n x =

Speech

put vector of the HMM e.g

SLIDE 6

General Overview(HTS):

Extract Spectrum, F0, labels Train Acoustic Models Speech Input Labels Parameter Generation Synthesis Filter Text Input rum, tic Stored Training Part Training Part Training Part Training Part

Synthesized Speech Stored Models Synthesis Part Synthesis Part Synthesis Part Synthesis Part

SLIDE 7

Challenges:

Generation of the full-context style la Addition of Stress/Syllable Layer. Defining the Question Set. Optimizing the Synthesized Quality. Optimizing the Synthesized Quality. le labels.

SLIDE 8

Full-Context Label Style:

P A K I S T_D A N

P-A-K T_D

Phoneme sequence

Tri Tri Tri Tri-

phone context dependen

phone context dependen phone context dependen phone context dependen

P A K I S T_D A N

Phoneme sequence x^P x^P x^P x^P-

A A A+K +K +K +K= = = =I@x_x I@x_x I@x_x I@x_x/A … /A … /A … /A …

S^T_D S^T_D S^T_D S^T_D-

Full

Full Full Full-

context style

context style context style context style context depe context depe context depe context depe

A N

_D-A-N

ndent model ndent model ndent model ndent model

A N

A A A+N= +N= +N= +N=x@x_x x@x_x x@x_x x@x_x/A … /A … /A … /A …

ependent model ependent model ependent model ependent model

SLIDE 9

Full-Context Format:

x^x-SIL+A=L@1_0/A:0_0_0/B:0-0-0@1-0& x^SIL-A+L=I_I@1_1/A:0_0_0/B:0-0-1@1-2& SIL^A-L+I_I=A@1_2/A:0_0_1/B:0-0-2@2-1 A^L-I_I+A=P@2_1/A:0_0_1/B:0-0-2@2-1& &1-1#1-1$1-1!0-0;0- … &1-9#1-3$1-1!0-2;0- … 1&2-8#1-3$1-1!0-1;0-0 … &2-8#1-3$1-1!0-1;0- …

9 9

۔۔۔ ا

SLIDE 10

Full-Context Format:

SIL^A-L+I_I=A@ 1_2/A:0_0_1/B:0-0-2@2-1& 0-0|I_I/C:1+0+2/D:0_0/E:co 4#0+1/F:content_2/G:0_0/ /I:8=6/J:17+11-2

Supr Segmental Context

Segmental Segmental Segmental Segmental

Current Phoneme
Previous two Phonemes
Next two Phonemes
Syl
Str
Wo
Ph
PO

1&2-8#1-3$1-1!0-1; /E:content+2@1+5&1+ _0/H:9=5^1=2|NONE

Supra-Segmental Context

10 10

Context

Supra Supra Supra Supra-

Segmental

Segmental Segmental Segmental Syllable Stress Word Phrase POS

SLIDE 11

teps to Generate Full-Conte

Extract Segmental & Word Layer Apply Stress & Syllabification Rules extGrid File Rules Align Syllable Boundaries with Segmental Layer Generate new TextGrid File with Additional Layer

ntext Labels:

11 11

Convert to Full- Context format ew with ayers

SLIDE 12

TextGrid Format:

12 12

SLIDE 13

teps to Generate Full-Conte

Extract Segmental & Word Layer Apply Stress & Syllabification Rules extGrid File Rules Align Syllable Boundaries with Segmental Layer Generate new TextGrid File with Additional Layer

ntext Labels:

13 13

Convert to Full- Context format ew with ayers

SLIDE 14

extGrid Format with Add Additional Layers:

14 14

SLIDE 15

Context Clustering (Quest

Number of possible combinations are these 53 53 53 53 different contexts. With only Segmental Context Possible 665 ≈ 1252 mil If we consider all the context, it will b If we consider all the context, it will b Solution: Solution: Solution: Solution: Record data having maximum phonem

r di-phone level.

Apply context clustering technique to acoustically similar models

uestion Set) 1/2:

s are quite enormous with sible models are: million ill be practically infinite.

15 15

ill be practically infinite.

neme coverage at tri-phone

e to classify and share

SLIDE 16

Context Clustering (Quest

Phoneme

{preceding, current, succeeding} phone

Stress/Syllable/Word/

# of phonemes at {preceding, current, s

stress of {preceding, current, succeedin Position of current syllable in current w # of syllables {from previous, to next} st Vowel within current syllable # of syllables in {preceding, current, suc

uestion Set) 2/2:

nemes

nt, succeeding} syllable

16 16

nt, succeeding} syllable eding} syllable nt word stressed syllable , succeeding} word

SLIDE 17

Some Synthesized Examp

Seen Context Seen Context Seen Context Seen Context: : : : Un Un Un Un-

seen

seen seen seen Context Context Context Context: : : : Different Carrier Word: Different Carrier Word: Different Carrier Word: Different Carrier Word:

mples:

Training Set: Training Set: Training Set: Training Set:

17 17

SLIDE 18

Questio Questio stions ?

18 18