Speech Processing 15-492/18-492 Speech Synthesis Talking heads - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Synthesis Talking heads - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More Information is Better Voice + text is easier to understand Voice + text is easier to understand Voice + face is easier too Voice + face is


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Synthesis Talking heads Singing Synthesis

slide-2
SLIDE 2

More Information is Better

  • Voice + text is easier to understand

Voice + text is easier to understand

  • Voice + face is easier too

Voice + face is easier too

slide-3
SLIDE 3

Talking Heads

  • Adds novelty/character/personification

Adds novelty/character/personification

  • Experiments

Experiments show better understanding show better understanding

  • Lip synching

Lip synching

  • Facial movements

Facial movements

  • Listeners swear its better synthesis

Listeners swear its better synthesis

slide-4
SLIDE 4

Talking heads

slide-5
SLIDE 5

Talking Heads

  • Synthesize text

Synthesize text

  • Output phone position in audio stream

Output phone position in audio stream

  • Map phones to lip/tongue positions

Map phones to lip/tongue positions

  • Build visual stream

Build visual stream

  • Choose appropriate frames

Choose appropriate frames

  • Aligned with audio

Aligned with audio

  • How many facial positions

How many facial positions

slide-6
SLIDE 6

Visemes

  • Baphy

Baphy Three positions Three positions

  • Closed, open and rounded

Closed, open and rounded

  • Rho

Rho

  • 10 lip positions

10 lip positions

  • Eyelid 4

Eyelid 4

  • Eyes 2

Eyes 2

  • When should the align

When should the align

  • Follow trajectories, not just at time instant

Follow trajectories, not just at time instant

  • Shape for syllables not just phones

Shape for syllables not just phones

slide-7
SLIDE 7

Synthesis Analogies

  • Articulatory

Articulatory Synthesis Synthesis

  • Modeling the vocal tract

Modeling the vocal tract

  • Baldi

Baldi: movement of muscles : movement of muscles

  • Format:

Format:

  • Modeling of signal synthetically

Modeling of signal synthetically

  • Carton based faces (

Carton based faces (Baphy Baphy) )

  • Concatenative

Concatenative

  • Joining natural segments

Joining natural segments

  • JPL example

JPL example

  • Interval’s Video Rewrite

Interval’s Video Rewrite

  • Unit size

Unit size

  • Baphy

Baphy == == uniphone uniphone

  • JPL ==

JPL == diphone diphone

  • Video Rewrite == unit selection

Video Rewrite == unit selection

slide-8
SLIDE 8

Talking Heads

  • Personalization:

Personalization:

  • Can look like a mask put on a dummy

Can look like a mask put on a dummy

  • Uncanny valley

Uncanny valley

  • The more human like, the more critical we are

The more human like, the more critical we are

  • 3

3-

  • D movement (in real time)

D movement (in real time)

  • Second

Second-

  • life type characters

life type characters

  • Gesture generation too

Gesture generation too

  • Off

Off-

  • line

line

  • (Gollum,

(Gollum, Jabba Jabba the Hut) the Hut)

  • Usually actors do the voices

Usually actors do the voices

slide-9
SLIDE 9

Singing Synthesis

  • Simple pitch and duration control

Simple pitch and duration control

  • But singing is more than that

But singing is more than that

  • Proper singing synthesis

Proper singing synthesis

  • Recording a singing database

Recording a singing database

  Phonetic, prosodic, and singing style coverage

Phonetic, prosodic, and singing style coverage

  • Sang rather than spoken voice

Sang rather than spoken voice

slide-10
SLIDE 10

Flinger (Festival Singer) (Macon)

  • Sinusoidal modeling

Sinusoidal modeling

  • More pitch control than just PSOLA

More pitch control than just PSOLA

  • MIDI interface

MIDI interface

  • Allow mixing with music

Allow mixing with music

  • Standard MIDI authoring techniques

Standard MIDI authoring techniques

slide-11
SLIDE 11

Festival Singing Mode

  • Dominic

Dominic Mazzoni Mazzoni (11 (11-

  • 752 project 2001)

752 project 2001)

  • XML based song description

XML based song description

  • <DURATION BEATS=“1.0”>

<DURATION BEATS=“1.0”>

  • <PITCH NOTE=“C4”>Oh</PITCH>

<PITCH NOTE=“C4”>Oh</PITCH>

  • </DURATION>

</DURATION>

  • But not just setting pitch at duration point

But not just setting pitch at duration point

  • When do you move it (based on syllable and voicing)

When do you move it (based on syllable and voicing)

  • How quickly do you move pitch

How quickly do you move pitch

slide-12
SLIDE 12

Singing Example

  • <?xml version="1.0"?>

<?xml version="1.0"?> <!DOCTYPE SINGING PUBLIC " <!DOCTYPE SINGING PUBLIC "-

  • //SINGING//DTD SINGING mark up//EN"

//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" "Singing.v0_1.dtd" []> []> <SINGING BPM="30"> <SINGING BPM="30"> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3"> <PITCH NOTE="C4"><DURATION BEATS="0.3">fah fah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3"> <PITCH NOTE="E4"><DURATION BEATS="0.3">lah lah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> </SINGING> </SINGING>

slide-13
SLIDE 13

Future in TTS

  • More natural voices

More natural voices

  • Sound human

Sound human

  • Interact in a human way (not just words)

Interact in a human way (not just words)

  • More personalization

More personalization

  • Sound like a particular person

Sound like a particular person

  • Cross lingual synthesis

Cross lingual synthesis

  • More flexible

More flexible

  • Say it with more feeling

Say it with more feeling

  • Realtime

Realtime voice transformation voice transformation

  • Have an American accent while you speak

Have an American accent while you speak

slide-14
SLIDE 14

Text to speech process

  • Text analysis

Text analysis

  • From characters to words

From characters to words

  • Linguistic analysis

Linguistic analysis

  • From words to pronunciations

From words to pronunciations

  • Waveform analysis

Waveform analysis

  • From pronunciations to noises

From pronunciations to noises

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

HW2: TTS

  • Due 3:30pm Monday October 20

Due 3:30pm Monday October 20th

th

  • Install Festival and

Install Festival and Festvox Festvox

  • Find 10 errors in each of two different

Find 10 errors in each of two different synthesizers synthesizers

  • Build a voice

Build a voice

  • A Talking Clock

A Talking Clock

  • A general voice

A general voice

  • (or both)

(or both)

slide-18
SLIDE 18