Speech Processing 15-492/18-492 Speaker ID Who is speaking? - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speaker ID Who is speaking? - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition Speaker ID, Speaker Recognition When do you use it When do you use it Security, Access Security, Access Speaker specific


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speaker ID

slide-2
SLIDE 2

Who is speaking?

  • Speaker ID, Speaker Recognition

Speaker ID, Speaker Recognition

  • When do you use it

When do you use it

  • Security, Access

Security, Access

  • Speaker specific modeling

Speaker specific modeling

  Recognize the speaker and use their options

Recognize the speaker and use their options

  • Diacritization

Diacritization

  In multi

In multi-

  • speaker environments

speaker environments

  Assign speech to different people

Assign speech to different people

  Allow questions like did Fred agree or not.

Allow questions like did Fred agree or not.

slide-3
SLIDE 3

Voice Identity

  • What makes a voice identity

What makes a voice identity

  • Lexical Choice:

Lexical Choice:

  Woo

Woo-

  • hoo

hoo, ,

  I pity the fool …

I pity the fool …

  • Phonetic choice

Phonetic choice

  • Intonation and duration

Intonation and duration

  • Spectral qualities (vocal tract shape)

Spectral qualities (vocal tract shape)

  • Excitation

Excitation

slide-4
SLIDE 4

Voice Identity

  • What makes a voice identity

What makes a voice identity

  • Lexical Choice:

Lexical Choice:

  Woo

Woo-

  • hoo

hoo, ,

  I pity the fool …

I pity the fool …

  • Phonetic choice

Phonetic choice

  • Intonation and duration

Intonation and duration

  • Spectral qualities (vocal tract shape)

Spectral qualities (vocal tract shape)

  • Excitation

Excitation

  • But which is most discriminative?

But which is most discriminative?

slide-5
SLIDE 5

GMM Speaker ID

  • Just looking at spectral part

Just looking at spectral part

  • Which is sort of vocal tract shape

Which is sort of vocal tract shape

  • Build a single Gaussian of

Build a single Gaussian of MFCCs MFCCs

  • Means and Standard Deviation of all speech

Means and Standard Deviation of all speech

  • Actually build N

Actually build N-

  • mixture Gaussian (32 or 64)

mixture Gaussian (32 or 64)

  • Build a model for each speaker

Build a model for each speaker

  • Use test data and see which model its

Use test data and see which model its closest to closest to

slide-6
SLIDE 6

GMM Speaker ID

  • How close does it need to be?

How close does it need to be?

  • One or two standard deviations?

One or two standard deviations?

  • The set of speakers needs to be different

The set of speakers needs to be different

  • If they are closest than one or two

If they are closest than one or two stddev stddev

  • You get confusion.

You get confusion.

  • Should you have a “general” model

Should you have a “general” model

  • Not one of the set of training speakers

Not one of the set of training speakers

slide-7
SLIDE 7

GMM Speaker ID

  • Works well on constrained tasks

Works well on constrained tasks

  • In similar acoustic conditions

In similar acoustic conditions

  • (not phone

(not phone vs vs wide wide-

  • band)

band)

  • Same spoken style as training data

Same spoken style as training data

  • Cooperative users

Cooperative users

  • Doesn’t work well when

Doesn’t work well when

  • Different speaking style (conversation/lecture)

Different speaking style (conversation/lecture)

  • Shouting whispering

Shouting whispering

  • Speaker has a cold

Speaker has a cold

  • Different language

Different language

slide-8
SLIDE 8

Speaker ID Systems

  • Training

Training

  • Example speech from each speaker

Example speech from each speaker

  • Build models for each speaker

Build models for each speaker

  • (maybe an exception model too)

(maybe an exception model too)

  • ID phase

ID phase

  • Compare test speech to each model

Compare test speech to each model

  • Choose “closest” model (or none)

Choose “closest” model (or none)

slide-9
SLIDE 9

Basic Speaker ID system

slide-10
SLIDE 10

Accuracy

  • Works well on smaller sets

Works well on smaller sets

  • 20

20-

  • 50 speakers

50 speakers

  • As number of speakers increase

As number of speakers increase

  • Models begin to overlap

Models begin to overlap – – confuse speakers confuse speakers

  • What can we do to get better distinctions

What can we do to get better distinctions

slide-11
SLIDE 11

What about transitions

  • Not just modeling isolates frames

Not just modeling isolates frames

  • Look at phone sequences

Look at phone sequences

  • But ASR

But ASR

  • Lots of variation

Lots of variation

  • Limited amount of phonetic space

Limited amount of phonetic space

  • What about lots of ASR engines

What about lots of ASR engines

slide-12
SLIDE 12

Phone-based Speaker ID

  • Use *lots* of ASR engines

Use *lots* of ASR engines

  • But they need to be different ASR engines

But they need to be different ASR engines

  • Use ASR engines from lots of different

Use ASR engines from lots of different languages languages

  • It doesn’t matter what language the speech is

It doesn’t matter what language the speech is

  • Use many different ASR engines

Use many different ASR engines

  • Gives lots of variation

Gives lots of variation

  • Build models of what phones are

Build models of what phones are recognized recognized

  • Actually we use HMM states not phones

Actually we use HMM states not phones

slide-13
SLIDE 13

Phone-based SID (Jin)

slide-14
SLIDE 14

Phone-based Speaker ID

  • Much better distinctions for larger datasets

Much better distinctions for larger datasets

  • Can work with 100 plus voices

Can work with 100 plus voices

  • Slightly more robust across styles/channels

Slightly more robust across styles/channels

slide-15
SLIDE 15

But we need more …

  • Combined models

Combined models

  • GMM models

GMM models

  • Ph

Ph-

  • based models

based models

  • Combine them

Combine them

  • Slightly better results

Slightly better results

  • What else …

What else …

  • Prosody (duration and F0)

Prosody (duration and F0)

slide-16
SLIDE 16

Can VC beat Speaker-ID

  • Can we fake voices?

Can we fake voices?

  • Can we fool Speaker ID systems?

Can we fool Speaker ID systems?

  • Can we make lots of money out of it?

Can we make lots of money out of it?

  • Yes to the first two

Yes to the first two

  • Jin,

Jin, Toth Toth, Black and Schultz ICASSP2008 , Black and Schultz ICASSP2008

slide-17
SLIDE 17

Training/Testing Corpus

  • LDC CSR

LDC CSR-

  • I (WSJ0)

I (WSJ0)

  • US English studio read speech

US English studio read speech

  • 24 Male speakers

24 Male speakers

  • 50 sentences training, 5 test

50 sentences training, 5 test

  • Plus 40 additional training sentences

Plus 40 additional training sentences

  • Sentence average length is 7s.

Sentence average length is 7s.

  • VT Source speakers

VT Source speakers

  • Kal_diphone

Kal_diphone (synthetic speech) (synthetic speech)

  • US English male natural speaker (not all sentences)

US English male natural speaker (not all sentences)

slide-18
SLIDE 18

Experiment I

  • VT GMM

VT GMM

  • Kal_diphone

Kal_diphone source speaker source speaker

  • GMM train 50 sentences

GMM train 50 sentences

  • GMM transform 5 test sentences

GMM transform 5 test sentences

  • SID GMM

SID GMM

  • Train 50 sentences

Train 50 sentences

  • (Test natural 5 sentences, 100% correct)

(Test natural 5 sentences, 100% correct)

slide-19
SLIDE 19

GMM-VT vs GMM-SID

Hello

  • VT fools GMM

VT fools GMM-

  • SID 100% of the time

SID 100% of the time

slide-20
SLIDE 20

GMM-VT vs GMM-SID

  • Not surprising (others show this)

Not surprising (others show this)

  • Both optimizing spectral properties

Both optimizing spectral properties

  • These used the same training set

These used the same training set

  • (different training sets doesn’t change result)

(different training sets doesn’t change result)

  • VT output voices sounds “bad”

VT output voices sounds “bad”

  • Poor excitation and voicing decision

Poor excitation and voicing decision

  • Human can distinguish VT

Human can distinguish VT vs vs Natural Natural

  • Actually GMM

Actually GMM-

  • SID can distinguish these too

SID can distinguish these too

  • If VT included in training set

If VT included in training set

slide-21
SLIDE 21

GMM-VT vs Phone-SID

  • VT is always S17, S24 or S20

VT is always S17, S24 or S20

  • Kal_diphone

Kal_diphone is recognized as S17 and S24 is recognized as S17 and S24

  • Phone

Phone-

  • SID seems to recognized

SID seems to recognized source source speaker speaker

slide-22
SLIDE 22

What about Synthetic Speech?

  • Clustergen

Clustergen: CG : CG

  • Statistical Parametric Synthesizer

Statistical Parametric Synthesizer

  • MLSA filter for

MLSA filter for resynthesis resynthesis

  • Clunits

Clunits: CL : CL

  • Unit Selection Synthesizer

Unit Selection Synthesizer

  • Waveform concatenation

Waveform concatenation

slide-23
SLIDE 23

Synth vs GMM-SID

  • Smaller is better

Smaller is better

slide-24
SLIDE 24

Synth vs Phone-SID

  • Smaller is better

Smaller is better

  • Opposite order from GMM

Opposite order from GMM-

  • SID

SID

slide-25
SLIDE 25

Conclusions

  • GMM

GMM-

  • VT fools GMM

VT fools GMM-

  • SID

SID

  • Ph

Ph-

  • SID can distinguish source speaker

SID can distinguish source speaker

  • Ph

Ph-

  • SID cares about dynamics

SID cares about dynamics

  • Synthesis (pretty much) fools Ph

Synthesis (pretty much) fools Ph-

  • SID

SID

  • We’ve not tried to distinguish

We’ve not tried to distinguish Synth Synth vs vs Real Real

slide-26
SLIDE 26

Future

  • Much larger dataset

Much larger dataset

  • 250 speakers (male and female)

250 speakers (male and female)

  • Open set (include background model)

Open set (include background model)

  • WSJ (0+1)

WSJ (0+1)

  • Use VT with long term dynamics

Use VT with long term dynamics

  • HTS adaptation

HTS adaptation

  • articulatory

articulatory position data position data

  • Prosodics

Prosodics (F0 and duration) (F0 and duration)

  • Use ph

Use ph-

  • SID to tune VT model

SID to tune VT model

slide-27
SLIDE 27

Future II

  • VT that fools Ph

VT that fools Ph-

  • SID

SID

  • Develop X

Develop X-

  • SID (prosody?)

SID (prosody?)

  Develop X

Develop X-

  • VT that fools X

VT that fools X-

  • SID

SID

  Develop X2

Develop X2-

  • SID

SID

  Develop X2

Develop X2-

  • VT that fools …

VT that fools … ….. …..

slide-28
SLIDE 28

De-identification

  • Using Speaker ID to score de

Using Speaker ID to score de-

  • identification

identification

  • Reverse of voice transformation

Reverse of voice transformation

  Masking source, rather than being like target

Masking source, rather than being like target

  • Simplest view

Simplest view

  • Full ASR and TTS in new engine (two hard)

Full ASR and TTS in new engine (two hard)

  • Voice conversion to synthetic voice

Voice conversion to synthetic voice

  • Natural speech to TTS (

Natural speech to TTS (kal_diphone kal_diphone) )

slide-29
SLIDE 29

De-identification

  • Tested against 24 speakers

Tested against 24 speakers

  • GMM transformation

GMM transformation

  • 50% de

50% de-

  • identification

identification

  • GMM+duration

GMM+duration normalization normalization

  • 60% de

60% de-

  • identification

identification

  • GMM+duration+transterpolation

GMM+duration+transterpolation

  • 80

80-

  • 100% de

100% de-

  • identification

identification

slide-30
SLIDE 30

Speaker-ID and Language

  • Identify which language someone is talking

Identify which language someone is talking

  • Identify their dialect

Identify their dialect

  • In Cross

In Cross-

  • lingual voice conversion

lingual voice conversion

  • Identify the accent (or lack of)

Identify the accent (or lack of)

  • Identify the speaker

Identify the speaker

  • Want close to source speaker and close to

Want close to source speaker and close to target language target language

slide-31
SLIDE 31

Speaker-ID

  • Annual international competitions

Annual international competitions

  • Given this data set (1000s speakers)

Given this data set (1000s speakers)

  • How well can you identify the test speakers

How well can you identify the test speakers

  • Vary the issues:

Vary the issues:

  Channel conditions (phone, non

Channel conditions (phone, non-

  • phone)

phone)

  Language/Speaker style

Language/Speaker style

  Realtime

Realtime vs vs fully offline fully offline

slide-32
SLIDE 32

HW4

  • A Company that deploys a very large on

A Company that deploys a very large on-

  • line gaming environment contacts you with

line gaming environment contacts you with the idea of adding a speech interface to the idea of adding a speech interface to their game. You task is to describe feasible their game. You task is to describe feasible methods to integrate speech into the game. methods to integrate speech into the game.

  • Address the following issues:

Address the following issues:

slide-33
SLIDE 33

HW4

  • What parts can use speech

What parts can use speech

  • Contrast, ASR/TTS, text to TTS and voice

Contrast, ASR/TTS, text to TTS and voice conversion conversion

  • How could use data from the system?

How could use data from the system?

  • How would you evaluate it?

How would you evaluate it?

  • What about translation?

What about translation?

  • Submission 3:30pm Monday Dec 8th

Submission 3:30pm Monday Dec 8th

slide-34
SLIDE 34