Hidden Markov (HMM) based S Synthesis using
Presenter: Omer Nawaz Research Officer (III)
- v Model
Hidden Markov ov Model (HMM) based S Speech Synthesis using ing - - PowerPoint PPT Presentation
Hidden Markov ov Model (HMM) based S Speech Synthesis using ing HTS Toolkit. Presenter: Omer Nawaz Research Officer (III) Speech Synthesis Overvie rview: Text to be Synthesized Natural Language Processing (NLP) (NLP) Speech
Presenter: Omer Nawaz Research Officer (III)
Text to be Synthesized Natural Language Processing (NLP) (NLP)
2
Speech Synthesis Engine Synthesized Speech
Rule-based, formant synthesis
Hand-crafting each phonetic units by rule
CORPUS-BASED: Concatenative synthesis
High quality speech can be synthesized concatenation algorithms. concatenation algorithms. To obtain various voices, a large amoun is necessary.
Statistical parametric synthesis
Generate speech parameters from stat Voice quality can easily be changed by HMM parameters. rules ized using waveform
statistical models by transforming
3
CORPUS-BASED:
Unit Selection HMM based.
Comparison of two Approaches:
Unit Unit Unit Unit Selection Selection Selection Selection Advantages: Advantages: Advantages: Advantages: Advantages: Advantages: Advantages: Advantages:
High Quality at Waveform level (Specific Domain)
Disadvantages: Disadvantages: Disadvantages: Disadvantages:
Vocode (Domai
HMM based HMM based HMM based HMM based
4
Small Foot Print Smooth Stable Quality coder sound main-independent)
Linear time-invariant system
) (n e
Excitation Pulse train
Source Source Source Source excitation part Vocal tract Vocal tract Vocal tract Vocal tract
system
) (n h ) (n e
White noise
The h(n) is defined by the state output
mel-cepstrum
ant Speech
tract tract tract tract resonance part
5
) ( * ) ( ) ( n e n h n x =
Speech
put vector of the HMM e.g
Extract Spectrum, F0, labels Train Acoustic Models Speech Input Labels Parameter Generation Synthesis Filter Text Input rum, tic Stored Training Part Training Part Training Part Training Part
6
Synthesized Speech Stored Models Synthesis Part Synthesis Part Synthesis Part Synthesis Part
Generation of the full-context style la Addition of Stress/Syllable Layer. Defining the Question Set. Optimizing the Synthesized Quality. Optimizing the Synthesized Quality. le labels.
7
P A K I S T_D A N
P-A-K T_D
Phoneme sequence
Tri Tri Tri Tri-
phone context dependen phone context dependen phone context dependen
P A K I S T_D A N
Phoneme sequence x^P x^P x^P x^P-
A A A+K +K +K +K= = = =I@x_x I@x_x I@x_x I@x_x/A … /A … /A … /A …
S^T_D S^T_D S^T_D S^T_D-
Full Full Full-
context style context style context style context depe context depe context depe context depe
A N
_D-A-N
8
ndent model ndent model ndent model ndent model
A N
A A A+N= +N= +N= +N=x@x_x x@x_x x@x_x x@x_x/A … /A … /A … /A …
ependent model ependent model ependent model ependent model
x^x-SIL+A=L@1_0/A:0_0_0/B:0-0-0@1-0& x^SIL-A+L=I_I@1_1/A:0_0_0/B:0-0-1@1-2& SIL^A-L+I_I=A@1_2/A:0_0_1/B:0-0-2@2-1 A^L-I_I+A=P@2_1/A:0_0_1/B:0-0-2@2-1& &1-1#1-1$1-1!0-0;0- … &1-9#1-3$1-1!0-2;0- … 1&2-8#1-3$1-1!0-1;0-0 … &2-8#1-3$1-1!0-1;0- …
9 9
۔۔۔ ا
SIL^A-L+I_I=A@ 1_2/A:0_0_1/B:0-0-2@2-1& 0-0|I_I/C:1+0+2/D:0_0/E:co 4#0+1/F:content_2/G:0_0/ /I:8=6/J:17+11-2
Supr Segmental Context
Segmental Segmental Segmental Segmental
1&2-8#1-3$1-1!0-1; /E:content+2@1+5&1+ _0/H:9=5^1=2|NONE
Supra-Segmental Context
10 10
Context
Supra Supra Supra Supra-
Segmental Segmental Segmental Syllable Stress Word Phrase POS
Extract Segmental & Word Layer Apply Stress & Syllabification Rules extGrid File Rules Align Syllable Boundaries with Segmental Layer Generate new TextGrid File with Additional Layer
11 11
Convert to Full- Context format ew with ayers
12 12
Extract Segmental & Word Layer Apply Stress & Syllabification Rules extGrid File Rules Align Syllable Boundaries with Segmental Layer Generate new TextGrid File with Additional Layer
13 13
Convert to Full- Context format ew with ayers
14 14
Number of possible combinations are these 53 53 53 53 different contexts. With only Segmental Context Possible 665 ≈ 1252 mil If we consider all the context, it will b If we consider all the context, it will b Solution: Solution: Solution: Solution: Record data having maximum phonem
Apply context clustering technique to acoustically similar models
s are quite enormous with sible models are: million ill be practically infinite.
15 15
ill be practically infinite.
e to classify and share
Phoneme
{preceding, current, succeeding} phone
Stress/Syllable/Word/
# of phonemes at {preceding, current, s
# of phonemes at {preceding, current, s
stress of {preceding, current, succeedin Position of current syllable in current w # of syllables {from previous, to next} st Vowel within current syllable # of syllables in {preceding, current, suc
nt, succeeding} syllable
16 16
nt, succeeding} syllable eding} syllable nt word stressed syllable , succeeding} word
Seen Context Seen Context Seen Context Seen Context: : : : Un Un Un Un-
seen seen seen Context Context Context Context: : : : Different Carrier Word: Different Carrier Word: Different Carrier Word: Different Carrier Word:
Training Set: Training Set: Training Set: Training Set:
17 17
18 18