S2S ASR Advanced issues Tight coupling Tight coupling ASR should - - PowerPoint PPT Presentation

▶

s2s asr advanced issues

S2S ASR Advanced issues Tight coupling Tight coupling ASR should - - PowerPoint PPT Presentation

Dec 24, 2023 204 likes •427 views

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should output N- -best best Translated all (lattice) Translated all (lattice) Choose best translation Choose best translation

slide-1

SLIDE 1

S2S ASR Advanced issues

Tight coupling

Tight coupling

ASR should output N

ASR should output N-

best

best

Translated all (lattice)

Translated all (lattice)

Choose best translation

Choose best translation

(MT as a LM for ASR)

(MT as a LM for ASR)

Remove

Remove disfluencies/hestitations disfluencies/hestitations

Add more relevant data

Add more relevant data

Automatically convert past tense/third person data to

Automatically convert past tense/third person data to present tense/ present tense/first+second first+second person … person …

slide-2

SLIDE 2

S2S TTS Advance Issues

MT output isn’t

MT output isn’t gramtical gramtical

TTS doesn’t care and just says it

TTS doesn’t care and just says it

TTS should try to say MT output with more

TTS should try to say MT output with more breaks. breaks.

TTS (unit selection)

TTS (unit selection)

As a LM on MT output

As a LM on MT output

Choose the best translation on what is said best

Choose the best translation on what is said best

slide-3

SLIDE 3

Speech Processing 15-492/18-492

Voice Conversion

slide-4

SLIDE 4

Voice Conversion

Live (or offline)

Live (or offline)

Convert an existing voice to another

Convert an existing voice to another

Use only a small amount of target speech

Use only a small amount of target speech

Uses:

Uses:

Synthesis without collecting lots of data

Synthesis without collecting lots of data

Disguising voices

Disguising voices

Emotional voices without full synthesis support

Emotional voices without full synthesis support

Also called

Also called

Voice transformation, Voice morphing

Voice transformation, Voice morphing

slide-5

SLIDE 5

Voice Identity

What makes a voice identity

What makes a voice identity

Lexical Choice:

Lexical Choice:

  Woo

Woo-

hoo

hoo, ,

  I pity the fool …

I pity the fool …

Phonetic choice

Phonetic choice

Intonation and duration

Intonation and duration

Spectral qualities (vocal tract shape)

Spectral qualities (vocal tract shape)

Excitation

Excitation

slide-6

SLIDE 6

Voice Conversion techniques

Full ASR and TTS

Full ASR and TTS

Much too hard to do reliably

Much too hard to do reliably

Codebook transformation

Codebook transformation

ASR HMM state to HMM state transformation

ASR HMM state to HMM state transformation

GMM based transformation

GMM based transformation

Build a mapping function between frames

Build a mapping function between frames

slide-7

SLIDE 7

Learning VC models

First need to get parallel speech

First need to get parallel speech

Source and Target say same thing

Source and Target say same thing

Use DTW to align (in the spectral domain)

Use DTW to align (in the spectral domain)

Trying to learn a functional mapping

Trying to learn a functional mapping

20

20-

50 utterances

50 utterances

“Text

“Text-

independent” VC

independent” VC

Means no parallel speech available

Means no parallel speech available

Use some form of synthesis to generate it

Use some form of synthesis to generate it

slide-8

SLIDE 8

VC Training process

Extract F0, power and MFCC from source

Extract F0, power and MFCC from source and target utterances and target utterances

DTW align source and target

DTW align source and target

Loop until convergence

Loop until convergence

Build GMM to map between source/target

Build GMM to map between source/target

DTW source/target using GMM mapping

DTW source/target using GMM mapping

slide-9

SLIDE 9

VC Training process

slide-10

SLIDE 10

VC Run-time

slide-11

SLIDE 11

Voice Transformation

Festvox

Festvox GMM transformation suite (Toda) GMM transformation suite (Toda) awb awb bdl bdl jmk jmk slt slt awb awb bdl bdl jmk jmk slt slt

slide-12

SLIDE 12

VC in Synthesis

Can be used as a post filter in synthesis

Can be used as a post filter in synthesis

Build

Build kal_diphone kal_diphone to target VC to target VC

Use on all output of

Use on all output of kal_diphone kal_diphone

Can be used to convert a full DB

Can be used to convert a full DB

Convert a full db and rebuild a voice

Convert a full db and rebuild a voice

slide-13

SLIDE 13

Style/Emotion Conversion

Unit Selection (or SPS)

Unit Selection (or SPS)

Require lots of data in desired style/emotion

Require lots of data in desired style/emotion

VC technique

VC technique

Use as filter to main voice (same speaker)

Use as filter to main voice (same speaker)

Convert neutral to angry, sad, happy …

Convert neutral to angry, sad, happy …

slide-14

SLIDE 14

Can you say that again?

Voice conversion for speaking in noise

Voice conversion for speaking in noise

Different quality when you repeat things

Different quality when you repeat things

Different quality when you speak in noise

Different quality when you speak in noise

Lombard effect (when very loud)

Lombard effect (when very loud)

“Speech

“Speech-

in

in-

noise” in regular noise

noise” in regular noise

slide-15

SLIDE 15

Speaking in Noise (Langner)

Collect data

Collect data

Randomly play noise in person’s ears

Randomly play noise in person’s ears

Normal

Normal

In Noise

In Noise

Collect 500 of each type

Collect 500 of each type

Build VC model

Build VC model

Normal

Normal -

> in

> in-

Noise

Noise

Actually

Actually

Spectral, duration, f0 and power differences

Spectral, duration, f0 and power differences

slide-16

SLIDE 16

Synthesis in Noise

For bus information task

For bus information task

Play different synthesis information

Play different synthesis information utts utts

With SIN synthesizer

With SIN synthesizer

With SWN synthesizer

With SWN synthesizer

With VC (SWN

With VC (SWN-

>SIN) synthesizer

>SIN) synthesizer

Measure their understanding

Measure their understanding

SIN synthesizer better (in Noise)

SIN synthesizer better (in Noise)

SIN synthesizer better (without Noise for elderly)

SIN synthesizer better (without Noise for elderly)

slide-17

SLIDE 17

Transterpolation

Incrementally transform a voice X%

Incrementally transform a voice X%

BDL

BDL-

SLT by 10%

SLT by 10%

SLT

SLT-

BDL by 10%

BDL by 10%

Count when you think it changes from M

Count when you think it changes from M-

F

F

Fun but what are the uses …

Fun but what are the uses …

slide-18

SLIDE 18

De-identification

Remove speaker identity

Remove speaker identity

But keep it still human like

But keep it still human like

Health Records

Health Records

HIPAA laws require this

HIPAA laws require this

Not just removing names and

Not just removing names and SSNs SSNs

Use Voice conversion to get “new” voices

Use Voice conversion to get “new” voices

slide-19

SLIDE 19

VC and SPS

Becoming closely related

Becoming closely related

Small amount of target speaker

Small amount of target speaker

Use larger background models

Use larger background models

slide-20

SLIDE 20

Cross Lingual Voice Conversion

Use phonetic mapping synthesis

Use phonetic mapping synthesis

Sounds like very accented speech

Sounds like very accented speech

Use VC to convert the output

Use VC to convert the output

Require only small amount of target language

Require only small amount of target language

slide-21

SLIDE 21