Speech Processing 15-492/18-492 Speaker ID Who is speaking? - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speaker ID Who is speaking? - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition Speaker ID, Speaker Recognition When do you use it When do you use it Security, Access Security, Access Speaker specific
SLIDE 1
SLIDE 2
Who is speaking?
- Speaker ID, Speaker Recognition
Speaker ID, Speaker Recognition
- When do you use it
When do you use it
- Security, Access
Security, Access
- Speaker specific modeling
Speaker specific modeling
Recognize the speaker and use their options
Recognize the speaker and use their options
- Diacritization
Diacritization
In multi
In multi-
- speaker environments
speaker environments
Assign speech to different people
Assign speech to different people
Allow questions like did Fred agree or not.
Allow questions like did Fred agree or not.
SLIDE 3
Voice Identity
- What makes a voice identity
What makes a voice identity
- Lexical Choice:
Lexical Choice:
Woo
Woo-
- hoo
hoo, ,
I pity the fool …
I pity the fool …
- Phonetic choice
Phonetic choice
- Intonation and duration
Intonation and duration
- Spectral qualities (vocal tract shape)
Spectral qualities (vocal tract shape)
- Excitation
Excitation
SLIDE 4
Voice Identity
- What makes a voice identity
What makes a voice identity
- Lexical Choice:
Lexical Choice:
Woo
Woo-
- hoo
hoo, ,
I pity the fool …
I pity the fool …
- Phonetic choice
Phonetic choice
- Intonation and duration
Intonation and duration
- Spectral qualities (vocal tract shape)
Spectral qualities (vocal tract shape)
- Excitation
Excitation
- But which is most discriminative?
But which is most discriminative?
SLIDE 5
GMM Speaker ID
- Just looking at spectral part
Just looking at spectral part
- Which is sort of vocal tract shape
Which is sort of vocal tract shape
- Build a single Gaussian of
Build a single Gaussian of MFCCs MFCCs
- Means and Standard Deviation of all speech
Means and Standard Deviation of all speech
- Actually build N
Actually build N-
- mixture Gaussian (32 or 64)
mixture Gaussian (32 or 64)
- Build a model for each speaker
Build a model for each speaker
- Use test data and see which model its
Use test data and see which model its closest to closest to
SLIDE 6
GMM Speaker ID
- How close does it need to be?
How close does it need to be?
- One or two standard deviations?
One or two standard deviations?
- The set of speakers needs to be different
The set of speakers needs to be different
- If they are closest than one or two
If they are closest than one or two stddev stddev
- You get confusion.
You get confusion.
- Should you have a “general” model
Should you have a “general” model
- Not one of the set of training speakers
Not one of the set of training speakers
SLIDE 7
GMM Speaker ID
- Works well on constrained tasks
Works well on constrained tasks
- In similar acoustic conditions
In similar acoustic conditions
- (not phone
(not phone vs vs wide wide-
- band)
band)
- Same spoken style as training data
Same spoken style as training data
- Cooperative users
Cooperative users
- Doesn’t work well when
Doesn’t work well when
- Different speaking style (conversation/lecture)
Different speaking style (conversation/lecture)
- Shouting whispering
Shouting whispering
- Speaker has a cold
Speaker has a cold
- Different language
Different language
SLIDE 8
Speaker ID Systems
- Training
Training
- Example speech from each speaker
Example speech from each speaker
- Build models for each speaker
Build models for each speaker
- (maybe an exception model too)
(maybe an exception model too)
- ID phase
ID phase
- Compare test speech to each model
Compare test speech to each model
- Choose “closest” model (or none)
Choose “closest” model (or none)
SLIDE 9
Basic Speaker ID system
SLIDE 10
Accuracy
- Works well on smaller sets
Works well on smaller sets
- 20
20-
- 50 speakers
50 speakers
- As number of speakers increase
As number of speakers increase
- Models begin to overlap
Models begin to overlap – – confuse speakers confuse speakers
- What can we do to get better distinctions
What can we do to get better distinctions
SLIDE 11
What about transitions
- Not just modeling isolates frames
Not just modeling isolates frames
- Look at phone sequences
Look at phone sequences
- But ASR
But ASR
- Lots of variation
Lots of variation
- Limited amount of phonetic space
Limited amount of phonetic space
- What about lots of ASR engines
What about lots of ASR engines
SLIDE 12
Phone-based Speaker ID
- Use *lots* of ASR engines
Use *lots* of ASR engines
- But they need to be different ASR engines
But they need to be different ASR engines
- Use ASR engines from lots of different
Use ASR engines from lots of different languages languages
- It doesn’t matter what language the speech is
It doesn’t matter what language the speech is
- Use many different ASR engines
Use many different ASR engines
- Gives lots of variation
Gives lots of variation
- Build models of what phones are
Build models of what phones are recognized recognized
- Actually we use HMM states not phones
Actually we use HMM states not phones
SLIDE 13
Phone-based SID (Jin)
SLIDE 14
Phone-based Speaker ID
- Much better distinctions for larger datasets
Much better distinctions for larger datasets
- Can work with 100 plus voices
Can work with 100 plus voices
- Slightly more robust across styles/channels
Slightly more robust across styles/channels
SLIDE 15
But we need more …
- Combined models
Combined models
- GMM models
GMM models
- Ph
Ph-
- based models
based models
- Combine them
Combine them
- Slightly better results
Slightly better results
- What else …
What else …
- Prosody (duration and F0)
Prosody (duration and F0)
SLIDE 16
Can VC beat Speaker-ID
- Can we fake voices?
Can we fake voices?
- Can we fool Speaker ID systems?
Can we fool Speaker ID systems?
- Can we make lots of money out of it?
Can we make lots of money out of it?
- Yes to the first two
Yes to the first two
- Jin,
Jin, Toth Toth, Black and Schultz ICASSP2008 , Black and Schultz ICASSP2008
SLIDE 17
Training/Testing Corpus
- LDC CSR
LDC CSR-
- I (WSJ0)
I (WSJ0)
- US English studio read speech
US English studio read speech
- 24 Male speakers
24 Male speakers
- 50 sentences training, 5 test
50 sentences training, 5 test
- Plus 40 additional training sentences
Plus 40 additional training sentences
- Sentence average length is 7s.
Sentence average length is 7s.
- VT Source speakers
VT Source speakers
- Kal_diphone
Kal_diphone (synthetic speech) (synthetic speech)
- US English male natural speaker (not all sentences)
US English male natural speaker (not all sentences)
SLIDE 18
Experiment I
- VT GMM
VT GMM
- Kal_diphone
Kal_diphone source speaker source speaker
- GMM train 50 sentences
GMM train 50 sentences
- GMM transform 5 test sentences
GMM transform 5 test sentences
- SID GMM
SID GMM
- Train 50 sentences
Train 50 sentences
- (Test natural 5 sentences, 100% correct)
(Test natural 5 sentences, 100% correct)
SLIDE 19
GMM-VT vs GMM-SID
Hello
- VT fools GMM
VT fools GMM-
- SID 100% of the time
SID 100% of the time
SLIDE 20
GMM-VT vs GMM-SID
- Not surprising (others show this)
Not surprising (others show this)
- Both optimizing spectral properties
Both optimizing spectral properties
- These used the same training set
These used the same training set
- (different training sets doesn’t change result)
(different training sets doesn’t change result)
- VT output voices sounds “bad”
VT output voices sounds “bad”
- Poor excitation and voicing decision
Poor excitation and voicing decision
- Human can distinguish VT
Human can distinguish VT vs vs Natural Natural
- Actually GMM
Actually GMM-
- SID can distinguish these too
SID can distinguish these too
- If VT included in training set
If VT included in training set
SLIDE 21
GMM-VT vs Phone-SID
- VT is always S17, S24 or S20
VT is always S17, S24 or S20
- Kal_diphone
Kal_diphone is recognized as S17 and S24 is recognized as S17 and S24
- Phone
Phone-
- SID seems to recognized
SID seems to recognized source source speaker speaker
SLIDE 22
What about Synthetic Speech?
- Clustergen
Clustergen: CG : CG
- Statistical Parametric Synthesizer
Statistical Parametric Synthesizer
- MLSA filter for
MLSA filter for resynthesis resynthesis
- Clunits
Clunits: CL : CL
- Unit Selection Synthesizer
Unit Selection Synthesizer
- Waveform concatenation
Waveform concatenation
SLIDE 23
Synth vs GMM-SID
- Smaller is better
Smaller is better
SLIDE 24
Synth vs Phone-SID
- Smaller is better
Smaller is better
- Opposite order from GMM
Opposite order from GMM-
- SID
SID
SLIDE 25
Conclusions
- GMM
GMM-
- VT fools GMM
VT fools GMM-
- SID
SID
- Ph
Ph-
- SID can distinguish source speaker
SID can distinguish source speaker
- Ph
Ph-
- SID cares about dynamics
SID cares about dynamics
- Synthesis (pretty much) fools Ph
Synthesis (pretty much) fools Ph-
- SID
SID
- We’ve not tried to distinguish
We’ve not tried to distinguish Synth Synth vs vs Real Real
SLIDE 26
Future
- Much larger dataset
Much larger dataset
- 250 speakers (male and female)
250 speakers (male and female)
- Open set (include background model)
Open set (include background model)
- WSJ (0+1)
WSJ (0+1)
- Use VT with long term dynamics
Use VT with long term dynamics
- HTS adaptation
HTS adaptation
- articulatory
articulatory position data position data
- Prosodics
Prosodics (F0 and duration) (F0 and duration)
- Use ph
Use ph-
- SID to tune VT model
SID to tune VT model
SLIDE 27
Future II
- VT that fools Ph
VT that fools Ph-
- SID
SID
- Develop X
Develop X-
- SID (prosody?)
SID (prosody?)
Develop X
Develop X-
- VT that fools X
VT that fools X-
- SID
SID
Develop X2
Develop X2-
- SID
SID
Develop X2
Develop X2-
- VT that fools …
VT that fools … ….. …..
SLIDE 28
De-identification
- Using Speaker ID to score de
Using Speaker ID to score de-
- identification
identification
- Reverse of voice transformation
Reverse of voice transformation
Masking source, rather than being like target
Masking source, rather than being like target
- Simplest view
Simplest view
- Full ASR and TTS in new engine (two hard)
Full ASR and TTS in new engine (two hard)
- Voice conversion to synthetic voice
Voice conversion to synthetic voice
- Natural speech to TTS (
Natural speech to TTS (kal_diphone kal_diphone) )
SLIDE 29
De-identification
- Tested against 24 speakers
Tested against 24 speakers
- GMM transformation
GMM transformation
- 50% de
50% de-
- identification
identification
- GMM+duration
GMM+duration normalization normalization
- 60% de
60% de-
- identification
identification
- GMM+duration+transterpolation
GMM+duration+transterpolation
- 80
80-
- 100% de
100% de-
- identification
identification
SLIDE 30
Speaker-ID and Language
- Identify which language someone is talking
Identify which language someone is talking
- Identify their dialect
Identify their dialect
- In Cross
In Cross-
- lingual voice conversion
lingual voice conversion
- Identify the accent (or lack of)
Identify the accent (or lack of)
- Identify the speaker
Identify the speaker
- Want close to source speaker and close to
Want close to source speaker and close to target language target language
SLIDE 31
Speaker-ID
- Annual international competitions
Annual international competitions
- Given this data set (1000s speakers)
Given this data set (1000s speakers)
- How well can you identify the test speakers
How well can you identify the test speakers
- Vary the issues:
Vary the issues:
Channel conditions (phone, non
Channel conditions (phone, non-
- phone)
phone)
Language/Speaker style
Language/Speaker style
Realtime
Realtime vs vs fully offline fully offline
SLIDE 32
HW4
- A Company that deploys a very large on
A Company that deploys a very large on-
- line gaming environment contacts you with
line gaming environment contacts you with the idea of adding a speech interface to the idea of adding a speech interface to their game. You task is to describe feasible their game. You task is to describe feasible methods to integrate speech into the game. methods to integrate speech into the game.
- Address the following issues:
Address the following issues:
SLIDE 33
HW4
- What parts can use speech
What parts can use speech
- Contrast, ASR/TTS, text to TTS and voice
Contrast, ASR/TTS, text to TTS and voice conversion conversion
- How could use data from the system?
How could use data from the system?
- How would you evaluate it?
How would you evaluate it?
- What about translation?
What about translation?
- Submission 3:30pm Monday Dec 8th
Submission 3:30pm Monday Dec 8th
SLIDE 34