Speech Processing 15-492/18-492 Speech Synthesis Talking heads - - PowerPoint PPT Presentation

▶

Aug 22, 2022 320 likes •513 views

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More Information is Better Voice + text is easier to understand Voice + text is easier to understand Voice + face is easier too Voice + face is

SLIDE 1

Speech Processing 15-492/18-492

Speech Synthesis Talking heads Singing Synthesis

SLIDE 2

More Information is Better

Voice + text is easier to understand

Voice + text is easier to understand

Voice + face is easier too

Voice + face is easier too

SLIDE 3

Talking Heads

Adds novelty/character/personification

Adds novelty/character/personification

Experiments

Experiments show better understanding show better understanding

Lip synching

Lip synching

Facial movements

Facial movements

Listeners swear its better synthesis

Listeners swear its better synthesis

SLIDE 4

Talking heads

SLIDE 5

Talking Heads

Synthesize text

Synthesize text

Output phone position in audio stream

Output phone position in audio stream

Map phones to lip/tongue positions

Map phones to lip/tongue positions

Build visual stream

Build visual stream

Choose appropriate frames

Choose appropriate frames

Aligned with audio

Aligned with audio

How many facial positions

How many facial positions

SLIDE 6

Visemes

Baphy

Baphy Three positions Three positions

Closed, open and rounded

Closed, open and rounded

Rho

10 lip positions

10 lip positions

Eyelid 4

Eyelid 4

Eyes 2

Eyes 2

When should the align

When should the align

Follow trajectories, not just at time instant

Follow trajectories, not just at time instant

Shape for syllables not just phones

Shape for syllables not just phones

SLIDE 7

Synthesis Analogies

Articulatory

Articulatory Synthesis Synthesis

Modeling the vocal tract

Modeling the vocal tract

Baldi

Baldi: movement of muscles : movement of muscles

Format:

Format:

Modeling of signal synthetically

Modeling of signal synthetically

Carton based faces (

Carton based faces (Baphy Baphy) )

Concatenative

Concatenative

Joining natural segments

Joining natural segments

JPL example

JPL example

Interval’s Video Rewrite

Interval’s Video Rewrite

Unit size

Unit size

Baphy

Baphy == == uniphone uniphone

JPL ==

JPL == diphone diphone

Video Rewrite == unit selection

Video Rewrite == unit selection

SLIDE 8

Talking Heads

Personalization:

Personalization:

Can look like a mask put on a dummy

Can look like a mask put on a dummy

Uncanny valley

Uncanny valley

The more human like, the more critical we are

The more human like, the more critical we are

3-

D movement (in real time)

D movement (in real time)

Second

Second-

life type characters

life type characters

Gesture generation too

Gesture generation too

Off-

line

line

(Gollum,

(Gollum, Jabba Jabba the Hut) the Hut)

Usually actors do the voices

Usually actors do the voices

SLIDE 9

Singing Synthesis

Simple pitch and duration control

Simple pitch and duration control

But singing is more than that

But singing is more than that

Proper singing synthesis

Proper singing synthesis

Recording a singing database

Recording a singing database

  Phonetic, prosodic, and singing style coverage

Phonetic, prosodic, and singing style coverage

Sang rather than spoken voice

Sang rather than spoken voice

SLIDE 10

Flinger (Festival Singer) (Macon)

Sinusoidal modeling

Sinusoidal modeling

More pitch control than just PSOLA

More pitch control than just PSOLA

MIDI interface

MIDI interface

Allow mixing with music

Allow mixing with music

Standard MIDI authoring techniques

Standard MIDI authoring techniques

SLIDE 11

Festival Singing Mode

Dominic

Dominic Mazzoni Mazzoni (11 (11-

752 project 2001)

752 project 2001)

XML based song description

XML based song description

<DURATION BEATS=“1.0”>

<DURATION BEATS=“1.0”>

<PITCH NOTE=“C4”>Oh</PITCH>

<PITCH NOTE=“C4”>Oh</PITCH>

</DURATION>

</DURATION>

But not just setting pitch at duration point

But not just setting pitch at duration point

When do you move it (based on syllable and voicing)

When do you move it (based on syllable and voicing)

How quickly do you move pitch

How quickly do you move pitch

SLIDE 12

Singing Example

<?xml version="1.0"?>

<?xml version="1.0"?> <!DOCTYPE SINGING PUBLIC " <!DOCTYPE SINGING PUBLIC "-

//SINGING//DTD SINGING mark up//EN"

//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" "Singing.v0_1.dtd" []> []> <SINGING BPM="30"> <SINGING BPM="30"> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3"> <PITCH NOTE="C4"><DURATION BEATS="0.3">fah fah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3"> <PITCH NOTE="E4"><DURATION BEATS="0.3">lah lah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> </SINGING> </SINGING>

SLIDE 13

Future in TTS

More natural voices

More natural voices

Sound human

Sound human

Interact in a human way (not just words)

Interact in a human way (not just words)

More personalization

More personalization

Sound like a particular person

Sound like a particular person

Cross lingual synthesis

Cross lingual synthesis

More flexible

More flexible

Say it with more feeling

Say it with more feeling

Realtime

Realtime voice transformation voice transformation

Have an American accent while you speak

Have an American accent while you speak

SLIDE 14

Text to speech process

Text analysis

Text analysis

From characters to words

From characters to words

Linguistic analysis

Linguistic analysis

From words to pronunciations

From words to pronunciations

Waveform analysis

Waveform analysis

From pronunciations to noises

From pronunciations to noises

SLIDE 15

SLIDE 16

SLIDE 17

HW2: TTS

Due 3:30pm Monday October 20

Due 3:30pm Monday October 20th

Install Festival and

Install Festival and Festvox Festvox

Find 10 errors in each of two different

Find 10 errors in each of two different synthesizers synthesizers

Build a voice

Build a voice

A Talking Clock

A Talking Clock

A general voice

A general voice

(or both)

(or both)

SLIDE 18