Speech Processing 15-492/18-492 Speech Synthesis Waveform - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Synthesis Waveform - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text Analysis Text Analysis Chunking, tokenization, token expansion Chunking, tokenization, token expansion Linguistic Analysis


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Synthesis Waveform generation 2

slide-2
SLIDE 2

Speech Synthesis

  • Text Analysis

Text Analysis

  • Chunking, tokenization, token expansion

Chunking, tokenization, token expansion

  • Linguistic Analysis

Linguistic Analysis

  • Pronunciations

Pronunciations

  • Prosody

Prosody

  • Waveform generation

Waveform generation

  • From phones and prosody to waveforms

From phones and prosody to waveforms

slide-3
SLIDE 3

Unit Selection vs Parametric

Unit Selection The “standard” method “Select appropriate sub-word units from large databases of natural speech” Parametric Synthesis: [NITECH: Tokuda et al] HMM-generation based synthesis Cluster units to form models Generate from the models “Take ‘average’ of units”

slide-4
SLIDE 4

Old vs New

Unit Selection: large carefully labelled database quality good when good examples available quality will sometimes be bad no control of prosody Parametric Synthesis: smaller less carefully labelled database quality consistent resynthesis requires vocoder, (buzzy) can (must) control prosody model size much smaller than Unit DB

slide-5
SLIDE 5

Example CG Voices

7 Arctic databases: 7 Arctic databases: 1200 utterances, 43K 1200 utterances, 43K segs segs, 1hr speech , 1hr speech awb awb bdl bdl clb clb jmk jmk ksp ksp rms rms slt slt

slide-6
SLIDE 6

Data size vs Quality

slt_arctic data size

5.685 5.685 14.55 14.55 4597 4597 1100 1100 5.755 5.755 15.02 15.02 2227 2227 500 500 6.047 6.047 17.41 17.41 824 824 200 200 6.278 6.278 19.47 19.47 435 435 100 100 6.761 6.761 24.29 24.29 230 230 50 50 MCD MCD RMS F0 RMS F0 Clusters Clusters Utts Utts

slide-7
SLIDE 7

Databases size vs Quality

  • SPS

SPS

  • rms_100

rms_100

  • rms_1132

rms_1132

  • Unit selection

Unit selection

  • rms_100

rms_100

  • rms_1132

rms_1132

slide-8
SLIDE 8

Advantages of SPS

  • Statistical Parameter Synthesis

Statistical Parameter Synthesis

  • More robust to errors in data

More robust to errors in data

  • Requires less data

Requires less data

  • Models are smaller (< 2MB

Models are smaller (< 2MB vs vs > 1GB) > 1GB)

  • Parametric models allows further processing

Parametric models allows further processing

slide-9
SLIDE 9

Disadvantages of SPS

  • Statistical Parametric Synthesis

Statistical Parametric Synthesis

“buzziness buzziness” of ” of resynthesized resynthesized speech speech

  • Doesn’t sound as good as the best unit

Doesn’t sound as good as the best unit selection selection

  • Still experimental

Still experimental

slide-10
SLIDE 10

Parametric Speech Models

  • Emotional Speech Synthesis

Emotional Speech Synthesis

  • Can collect small amounts of emotional speech

Can collect small amounts of emotional speech

  • Build models that transform base model

Build models that transform base model

  • Cross Lingual Speech Synthesis

Cross Lingual Speech Synthesis

  • From language independent models

From language independent models

  • Transform with small amount of target language

Transform with small amount of target language

  • Use various ASR techniques

Use various ASR techniques

  • Adaptation

Adaptation

  • Discriminative training

Discriminative training

  • Use as much CPU as the ASR people

Use as much CPU as the ASR people

slide-11
SLIDE 11

Corpus-based Synthesis

  • Doesn’t really “just work”

Doesn’t really “just work”

  • Need to consider database content

Need to consider database content

  • Speaker style

Speaker style

  • What you send to the synthesizer

What you send to the synthesizer

slide-12
SLIDE 12

The right type of database

  • Recording style defines synthesis style

Recording style defines synthesis style

  • News stories will give news style

News stories will give news style-

  • synthesizer

synthesizer

  • News style not appropriate for dialog system

News style not appropriate for dialog system

  • Natural

Natural vs vs controlled prompts controlled prompts

  • Natural utterances good for general synthesizer

Natural utterances good for general synthesizer

  • Domain targeted better for domain synthesizer

Domain targeted better for domain synthesizer

slide-13
SLIDE 13

The right type of speaker

  • Professional speakers are better

Professional speakers are better

  • Consistent style and articulation

Consistent style and articulation

  • Lecturers, teachers are often better

Lecturers, teachers are often better

  • You can learn to do it well

You can learn to do it well

  • Ideal selection process (AT&T:

Ideal selection process (AT&T: Syrdal Syrdal 99) 99)

  • Record 20 professional speakers

Record 20 professional speakers

  • Build limit synthesizers from them

Build limit synthesizers from them

  • Collect many peoples preferences (> 200)

Collect many peoples preferences (> 200)

  • Record the “best”

Record the “best” speaker(s speaker(s) )

  • Find correlates in human speech

Find correlates in human speech

  • High power in unvoiced speech

High power in unvoiced speech

  • High power in higher frequencies

High power in higher frequencies

  • Larger pitch range

Larger pitch range

  • Different people prefer different voices

Different people prefer different voices

  • Provide a choice

Provide a choice

  • Errors are sometimes diminished by novelty

Errors are sometimes diminished by novelty

slide-14
SLIDE 14

The right type of things to synthesize

  • Instead of making the db appropriate

Instead of making the db appropriate

  • Restrict the text input

Restrict the text input

  • Domain synthesis

Domain synthesis

  • “The temperature is X degrees and the outlook

“The temperature is X degrees and the outlook is Y”. is Y”.

  • Make the database directly match text

Make the database directly match text

  • Fill templates with values

Fill templates with values

slide-15
SLIDE 15

Limited Domain Synthesis

  • General Unit Selection Synthesis

General Unit Selection Synthesis

  • Can be high quality

Can be high quality

  • Sometimes bad quality

Sometimes bad quality

  • Expensive to tune

Expensive to tune

  • Limited Domain Synthesis

Limited Domain Synthesis

  • Design database to match exactly what you to

Design database to match exactly what you to synthesize synthesize

  • Only reasonable if building voice per application

Only reasonable if building voice per application is easy is easy

slide-16
SLIDE 16

Building a Voice

  • Designing the Prompts

Designing the Prompts

  • Recording the Prompts

Recording the Prompts

  • Labeling the Utterances

Labeling the Utterances

  • Finding parameters (F0, MCEP)

Finding parameters (F0, MCEP)

  • Building the synthesis voice

Building the synthesis voice

  • Tuning and Testing

Tuning and Testing

slide-17
SLIDE 17

Designing the Prompts

  • From a grammar

From a grammar

  • System says: The temperature is X degrees

System says: The temperature is X degrees

  • From example data

From example data

  • Using example output from the existing system

Using example output from the existing system

  • From thinking about it

From thinking about it

  • But you *will* make mistakes

But you *will* make mistakes

  • Ideally:

Ideally:

  • Word coverage

Word coverage

  • Bi

Bi-

  • gram coverage

gram coverage

  • Prosody position coverage

Prosody position coverage

  • Design prompts to limit prosodic variance

Design prompts to limit prosodic variance

  • Boston, is that where you want to go?

Boston, is that where you want to go?

  • Do you want to go to Boston?

Do you want to go to Boston?

slide-18
SLIDE 18

Domains

  • Fixed template filling

Fixed template filling

  • Talking clocks, 24 utterances

Talking clocks, 24 utterances

  • Weather 100 utterances (don’t say place name)

Weather 100 utterances (don’t say place name)

  • Larger domains (spoken dialog systems)

Larger domains (spoken dialog systems)

  • Let’s Go bus information (Hybrid)

Let’s Go bus information (Hybrid)

  • Standard prompts

Standard prompts

  • Times and bus numbers

Times and bus numbers

  • 15,000 bus stop names (not fully covered)

15,000 bus stop names (not fully covered)

  • Backup general synthesis prompts

Backup general synthesis prompts

slide-19
SLIDE 19

A talking clock

  • Design the prompts:

Design the prompts:

  • The time is now, about five past one, in the morning

The time is now, about five past one, in the morning

  • The time is now, just after ten past two, in the morning

The time is now, just after ten past two, in the morning

  • The time is now, exactly quarter to three, in the morning

The time is now, exactly quarter to three, in the morning

  • The time is now, almost twenty past four, in the morning

The time is now, almost twenty past four, in the morning

  • Get full

Get full word coverage word coverage

  • *really* test you have word coverage

*really* test you have word coverage

  • No, *really* test you have word coverage

No, *really* test you have word coverage

slide-20
SLIDE 20

Record the prompts

  • Get highest quality recordings

Get highest quality recordings

  • Recording studio

Recording studio

  • Head mounted mike

Head mounted mike

  • Repeatable conditions

Repeatable conditions

  • Get signed permission

Get signed permission

  • Explain what you are doing

Explain what you are doing

slide-21
SLIDE 21

Label the data

  • Using HMM

Using HMM-

  • based or DTW

based or DTW-

  • based system

based system

  • Find the phoneme segments

Find the phoneme segments

  • Simple cases (< 50 utterances)

Simple cases (< 50 utterances)

  • Use DTW

Use DTW

  • Synthesize the prompts

Synthesize the prompts

  • Align synthesized prompts with actual prompts

Align synthesized prompts with actual prompts

slide-22
SLIDE 22

Automatic Labeling

slide-23
SLIDE 23

Automatic Labeling (bad)

slide-24
SLIDE 24

Parameterization

  • Extract pitch marks from data

Extract pitch marks from data

  • Find voices/unvoiced regions

Find voices/unvoiced regions

  • Add “fake” pitch marks during unvoiced regions

Add “fake” pitch marks during unvoiced regions

  • Extract MFCC pitch synchronously

Extract MFCC pitch synchronously

  • Instead of a fixed frame advance (e.g. 5ms)

Instead of a fixed frame advance (e.g. 5ms)

  • Extract it at each pitch mark

Extract it at each pitch mark

  • Try to capture the spectrum at the pitch period

Try to capture the spectrum at the pitch period

slide-25
SLIDE 25

Pitchmarks

slide-26
SLIDE 26

Building a LDOM synthesizer

  • Build cluster tree on each unit type

Build cluster tree on each unit type

  • Not just on phones

Not just on phones

  • Tag phones with word they come from

Tag phones with word they come from

  • d_limited

d_limited and and d_domain d_domain are treated as different are treated as different

slide-27
SLIDE 27

Tuning and Testing

  • Test it on some real data

Test it on some real data

  • Ensure number/symbol expansions are correct

Ensure number/symbol expansions are correct

  • Prompts should probably be word expanded

Prompts should probably be word expanded

  • Flight US187

Flight US187 -

  • > flight u s one eight seven

> flight u s one eight seven

  • Remove bad prompts

Remove bad prompts

  • Or fix labels

Or fix labels

  • Remember to keep access to the speaker

Remember to keep access to the speaker

  • If you have to update the system, you need the same

If you have to update the system, you need the same speaker available speaker available

slide-28
SLIDE 28

Summary

  • Unit selection

Unit selection vs vs Statistical Parametric Statistical Parametric Synthesis Synthesis

  • US: can be excellent (but not always)

US: can be excellent (but not always)

  • SPS: more robust

SPS: more robust

  • Building a voice

Building a voice

  • Databases design, recording, labeling

Databases design, recording, labeling

  • Parameter extraction and model building

Parameter extraction and model building

  • Limited domain synthesis

Limited domain synthesis

slide-29
SLIDE 29