[PPT] - 6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept PowerPoint Presentation

SLIDE 1

1

6-Text To Speech (TTS) Speech Synthesis

Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech

– Articulatory Approaches – Concatenative Approaches – HMM-based Approaches – Rule-Based Approaches

SLIDE 2

2

Speech Synthesis Concept

Text to Phone Sequence Phone Sequence to Speech Text Speech waveform Natural Language Processing (NLP) Speech Processing Text Speech waveform

SLIDE 3

3

Speech Naturalness

Obviation of undesirable noise and distortion and dissociation from speech Prosody generation

– Speech energy – Duration – pitch – Intonation – Stress

SLIDE 4

4

Speech Naturalness (Cont’d)

Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time

SLIDE 5

Which word receives an intonation?

It depends on the context.

– The ‘new’ information in the answer to a question is

ften accented

while the ‘old’ information is usually not.

– Q1: What types of foods are a good source of vitamins? – A1: LEGUMES are a good source of vitamins. – Q2: Are legumes a source of vitamins? – A2: Legumes are a GOOD source of vitamins. – Q3: I’ve heard that legumes are healthy, but what are they a good source of ? – A3: Legumes are a good source of VITAMINS.

Slide from Jennifer Venditti

SLIDE 6

Same ‘tune’, different alignment

50 100 150 200 250 300 350 400

LEGUMES are a good source of vitamins The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti

SLIDE 7

Same ‘tune’, different alignment

50 100 150 200 250 300 350 400

Legumes are a GOOD source of vitamins The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti

SLIDE 8

Same ‘tune’, different alignment

legumes are a good source of VITAMINS

50 100 150 200 250 300 350 400

The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti

SLIDE 9

Types of Waveform Synthesis

9

Articulatory Synthesis:

– Model movements of articulators and acoustics of vocal tract

Concatenative Synthesis:

– Use databases of stored speech to assemble new utterances.

Diphone Unit Selection

Statistical (HMM) Synthesis

– Trains parameters on databases of speech

Rule-Based (Formant) Synthesis:

– Start with acoustics, create rules/filters to create waveform

SLIDE 10

Articulatory Synthesis

Simulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804) and

thers used bellows, reeds and tubes to

construct mechanical speaking machines Modern versions “simulate” electronically the effect of articulator positions, vocal tract shape, etc on air flow.

10

SLIDE 11

Concatenative approaches

Two main approaches: 1- Concatenating Phone Units

– Example: concatenating samples of recorded diphones or syllables

2- Unit selection

– Uses several samples for each phone unit and selects the most appropriate one when synthesizing

11

SLIDE 12

12

Phone Units

Paragraph ( ) Sentence ( ) Word (Depends on the language. Usually more than 100,000) Syllable Diphone & Triphone Phoneme (Between 10 , 100)



SLIDE 13

13

Phone Units (Cont’d)

Diphone : We model Transitions between two phonemes

p1 p2 p3 p4 p5

. . . . .

Diphone Phoneme

SLIDE 14

14

Phone Units (Cont’d)

Farsi phonemes: 30 Farsi diphones: 30*30 = 900 Farsi triphones: 27000 in theory Not all of the triphones are used

SLIDE 15

15

Phone Units (Cont’d)

Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVC , CVCC We have about 4000 Syllables in Farsi Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . . Number of Syllables in English is too many

SLIDE 16

16

Phone Sequence To Speech (Cont’d)

Text to Phone Sequence Phone Sequence to primitive utterance Text Speech primitive utterance to Natural Speech NLP Speech Processing

SLIDE 17

17

Concatenative Approaches

In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform

SLIDE 18

18

Concatenative Approaches (Cont’d)

Benefits of storing compressed parameters instead of main waveform

– Less memory use – General state instead of a specific stored utterance – Generating prosody easily

SLIDE 19

19

Concatenative Approaches (Cont’d)

Phone Unit Type of Storing

Paragraph Sentence Word Syllable Diphone Phoneme Main Waveform Main Waveform Main Waveform Coded/Main Waveform Coded Waveform Coded Waveform

SLIDE 20

20

Concatenative Approaches (Cont’d)

Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing Overlap-Add-Method is a standard DSP method PSOLA is a base action for Voice Conversion. In this method in analysis stage we select frames that are synchronous by pitch markers.

SLIDE 21

Diphone Architecture Example

Training:

– Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones,

cut each diphone out and create a diphone database

Synthesizing an utterance,

– grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal processing at boundaries – use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones

21

SLIDE 22

Unit Selection

Same idea as concatenative synthesis, but database contains bigger varieties of “phone units” from diphones to sentences Multiple examples of phone units (under different prosodic conditions) are recorded Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection

22

SLIDE 23

Unit Selection

23

Unlike diphone concatenation, little or no signal processing applied to each unit Natural data solves problems with diphones

– Diphone databases are carefully designed but:

Speaker makes errors Speaker doesn’t speak intended dialect Require database design to be right

– If it’s automatic

Labeled with what the speaker actually said Coarticulation, schwas, flaps are natural

“There’s no data like more data”

– Lots of copies of each unit mean you can choose just the right one for the context – Larger units mean you can capture wider effects

SLIDE 24

Unit Selection Issues

24

Given a big database For each segment (diphone) that we want to synthesize

– Find the unit in the database that is the best to synthesize this target segment

What does “best” mean?

– “Target cost”: Closest match to the target description, in terms of

Phonetic context F0, stress, phrase position

– “Join cost”: Best join with neighboring units

Matching formants + other spectral characteristics Matching energy Matching F0 n n n n target join 1 1 i i i-1 i i=1 i=2

C(t ,u )= C (t ,u )+ C (u ,u )

 

SLIDE 25

Unit Selection Search

25

SLIDE 26

Joining Units

26

unit selection, just like diphone, need to join the units

– Pitch-synchronously

For diphone synthesis, need to modify F0 and duration

– For unit selection, in principle also need to modify F0 and duration of selection units – But in practice, if unit-selection database is big enough (commercial systems)

no prosodic modifications (selected targets may already be close to desired prosody)

SLIDE 27

Unit Selection Summary

27

Advantages

– Quality is far superior to diphones – Natural prosody selection sounds better

Disadvantages:

– Quality can be very bad in some places

HCI problem: mix of very good and very bad is quite annoying

– Synthesis is computationally expensive – Needs more memory than diphone synthesis

SLIDE 28

28

Rule-Based Approach Stages

Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model

SLIDE 29

29

KLATT 80 Model

SLIDE 30

30

KLATT 88 Model

SLIDE 31

31

KL GLOTT 88 model (default) SPECTRAL TILT LOW-PAS RESONANTOR MODIFIED LF MODEL ASPIRATION NOISE GENERATOR FIRST DIFFERENCE PREEMPHASIS NASAL FORMANT RESONATOR TRACHEAL FORMANT RESONATOR FOURTH FORMANT RESONATOR THIRTH FORMANT RESONATOR SECOND FORMANT RESONATOR FIRST FORMANT RESONATOR FRICATION NOISE GENERATOR SECOND FORMANT RESONATOR THIRD FORMANT RESONATOR FOURTH FORMANT RESONATOR FIFTH FORMANT RESONATOR SIXTH FORMANT RESONATOR

A2F A3F A4F A5F A6F AB ANV A1V A2V A3V A4V ATV

+

+
+
+

+

+
+

+

FILTERED IMPULSE TRAIN F0 AV OO FL DI SO SS TL AH AF GLOTTAL SOUND SOURCES CP BYPASS PATH B2F B3F B4F B5F B6F F6 PARALLEL VOCAL TRACT MODEL LYRYNGEAL SOUND SOURCES (NORMALLY NOT USED) PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5 CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES NASAL POLE ZERO PAIR TRACHEAL POLE ZERO PAIR FIRST FORMANT RESONATOR SECOND FORMANT RESONATOR THIRTH FORMANT RESONATOR FOURTH FORMANT RESONATOR FIFTH FORMANT RESONATOR

THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZER

FNP FNZ FTP FTZ F1 B1

SLIDE 32

32

Three Voicing Source Model In KLATT 88

The old KLSYN impulsive source The KLGLOTT88 model The modified LF model

SLIDE 33

HMM-Based Synthesis

Corpus-based, statistical parametric synthesis – Proposed in mid-90s, becomes popular since mid-00s – Large data + automatic training => Automatic voice building

– Source-filter model + statistical acoustic model Flexible to change its voice characteristics – HMM as its statistical acoustic model – We focus on HMM-based speech synthesis

33

SLIDE 34

34

SLIDE 35

First extract parametric representations of speech including spectral and excitation parameters from a speech database Model them using a set of generative models (e.g., HMMs) Training Synthesis

35

SLIDE 36

Speech Parameter Modeling Based on HMM Spectral parameter modeling Excitation parameter modeling State duration modeling

36

SLIDE 37

Spectral parameter modeling

Mel-cepstral analysis has been used for spectral estimation A continuous density HMM is used for the vocal tract modeling in the same way as speech recognition systems. The continuous density Markov model is a finite state machine which makes one state transition at each time unit (i.e. frame). First, a decision is made to which state to succeed (including the state itself). Then an output vector is generated according to the probability density function (pdf) for the current state

37

SLIDE 38

Cont’d…

38

SLIDE 39

F0 parameter modeling

While the observation of F0 has a continuous value in the voiced region, there exists no value for the unvoiced

region. We can model this kind of
bservation sequence assuming that the
bserved F0 value occurs from one-

dimensional spaces and the “unvoiced” symbol occurs from the zero-dimensional space.

39

SLIDE 40

40

SLIDE 41

Calculation of dynamic feature:

As was mentioned, mel-cepstral coefficient is used as spectral parameter, Their dynamic featureس Δc and Δ2c are calculated as follows: Dynamic features for F0: In unvoiced region, pt, Δpt and Δ2pt are defined as a discrete symbol. When dynamic features at the boundary between voiced and unvoiced cannot be calculated, they are defined as a discrete symbol.

41

SLIDE 42

Effect of dynamic feature

By using dynamic features, the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariances of those Estimation will be smoother Good and bad effect

42

SLIDE 43

Multi-Stream HMM structure:

The sequence of mel-cepstral coefficient vector and F0 pattern are modeled by a continuous density HMM and multi-space probability distribution HMM Putting all this together has some advantages

43

SLIDE 44

Synthesis part

An arbitrarily given text to be synthesized is converted to a context-based label sequence. The text is converted a context dependent label sequence by a text analyzer. For the TTS system, the text analyzer should have the ability to extract contextual information. However, no text analyzer has the ability to extract accentual phrase and to decide accent type of accentual phrase.

44

SLIDE 45

Some contextual factors

When we have HMM for each phoneme {preceding, current, succeeding} phoneme position of breath group in sentence {preceding, current, succeeding} part-of-speech position of current accentual phrase in current breath group position of current phoneme in current accentual phrase

45

SLIDE 46

46

SLIDE 47

According to the obtained state durations, a sequence of mel-cepstral coefficients and F0 values including voiced/unvoiced decisions is generated from the sentence HMM by using the speech parameter generation algorithm Finally, speech is synthesized directly from the generated mel-cepstral coefficients and F0 values by the MLSA filter

47

SLIDE 48

Spectral representation & corresponding filter

48

cepstrum: LMA filter generalized cepstrum: GLSA filter mel-cepstrum: MLSA (Mel Log Spectrum Approximation) filter mel-generalized cepstrum: MGLSA filter LSP: LSP filter PARCOR: all-pole lattice filter LPC: all-pole filter

SLIDE 49

Advantages

Most of the advantages of statistical parametric synthesis against unit-selection synthesis are related to its flexibility due to the statistical modeling process. Transforming voice characteristics, speaking styles, and emotions. Also Combination of unit selection and voice conversion (VC) techniques can alleviate this problem, high-quality voice-conversion is still problematic.

49

SLIDE 50

Adaptation (mimicking voices):Techniques of adaptation were originally developed in speech recognition to adjust general acoustic model , These techniques have also been applied to HMM-based speech synthesis to

btain speaker-specific synthesis systems with a small amount of speech

data Interpolation (mixing voices): Interpolate parameters among representative HMM sets

Can obtain new voices even when no adaptation data is available
Gradually change speakers. & speaking styles

50

SLIDE 51

Some rule-based applications

KLATT C Implementation: http://www.cs.cmu.edu/afs/cs/project/ai- repository/ai/areas/speech/systems/klatt/0.html A web page which generates speech waveform online given KLATT parameters: http://www.asel.udel.edu/speech/ tutorials/synthesis/Klatt.html Formant Synthesis Demo using Fant’s Formant Model http://www.speech.kth.se/ wavesurfer/formant/

51

SLIDE 52

TTS Online Demos

52

AT&T:

– http://www2.research.att.com/~ttsweb/tts/demo.php

Festival

– http://www.cstr.ed.ac.uk/projects/festival/morevoices. html

Cepstral

– http://www.cepstral.com/cgi-bin/demos/general

IBM

– http://www.research.ibm.com/tts/coredemo.shtml

SLIDE 53

Festival

Open source speech synthesis system Designed for development and runtime use

– Use in many commercial and academic systems – Distributed with RedHat 9.x, etc – Hundreds of thousands of users – Multilingual

No built-in language Designed to allow addition of new languages

– Additional tools for rapid voice development

Statistical learning tools Scripts for building models

1/5/07

Text from Richard Sproat

SLIDE 54

Festival as software

http://festvox.org/festival/ General system for multi-lingual TTS C/C++ code with Scheme scripting language General replaceable modules:

– Lexicons, LTS, duration, intonation, phrasing, POS tagging, tokenizing, diphone/unit selection, signal processing

General tools

– Intonation analysis (f0, Tilt), signal processing, CART building, N-gram, SCFG, WFST

1/5/07

Text from Richard Sproat

SLIDE 55

Festival as software

http://festvox.org/festival/ No fixed theories New languages without new C++ code Multiplatform (Unix/Windows) Full sources in distribution Free software

1/5/07

Text from Richard Sproat

SLIDE 56

CMU FestVox project

Festival is an engine, how do you make voices? Festvox: building synthetic voices:

– Tools, scripts, documentation – Discussion and examples for building voices – Example voice databases – Step by step walkthroughs of processes

Support for English and other languages Support for different waveform synthesis methods

– Diphone – Unit selection – Limited domain

1/5/07

Text from Richard Sproat

SLIDE 57

Future Trends

Speaker adaptation Language adaptation

57