Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring
Acoustic Modeling Acoustic Modeling
Speech and Signal Variability
Speech and Signal Variability
Measuring Error
Measuring Error
Pronunciation lexicons
Pronunciation lexicons
Variability in Speech Signal Variability in Speech Signal
“
“Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.
Homophones: same pronunciation
Homophones: same pronunciation
“
“wright” “right” “write” / r ay t / wright” “right” “write” / r ay t /
“
“ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /
Style Variability Style Variability
Different articulation in different situations
Different articulation in different situations
Clear vs Conversational
Clear vs Conversational
Whisper vs shouting
Whisper vs shouting
Talking to machine, talking to others
Talking to machine, talking to others
Frustrated speech
Frustrated speech
Speaker variability Speaker variability
Gender, age, dialect, health
Gender, age, dialect, health
Speaker dependent systems
Speaker dependent systems
Speaker independent systems
Speaker independent systems
Speaker adaptive systems
Speaker adaptive systems
Enrolment stage (acoustics and language)
Enrolment stage (acoustics and language)
Environment Variability Environment Variability
Different background noises
Different background noises
Office vs Outside
Office vs Outside
Different applications, different
Different applications, different environments environments
Desktop dictation, to Warehouse pick
Desktop dictation, to Warehouse pick
Single speaker vs Multispeaker
Single speaker vs Multispeaker
Background music
Background music
Channel Variability Channel Variability
Telephone vs Desktop
Telephone vs Desktop
8KHz vs 16KHz
8KHz vs 16KHz
Mobile vs Desktop
Mobile vs Desktop
Close-talking vs far-field
Close-talking vs far-field
Cell Phone vs Landline vs VOIP
Cell Phone vs Landline vs VOIP
Measuring Speech Recognition Error Measuring Speech Recognition Error
Word Error Rate
Word Error Rate
Substitutions: word is replaced
Substitutions: word is replaced
Deletions: word is missed out
Deletions: word is missed out
Insertions: word is added
Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence
Word Error Rate Word Error Rate
WER requires:
WER requires:
Transcription (the correct word string)
Transcription (the correct word string)
Alignment between ASR output and Transcript
Alignment between ASR output and Transcript
Not just left to right matching
Not just left to right matching
Sometimes Accuracy is given
Sometimes Accuracy is given
100-WER
100-WER
NOT number of words correct
NOT number of words correct
Word Error Rate Word Error Rate
Can get > 100%
Can get > 100%
But something is very wrong
But something is very wrong
Outputting “the” only, ignoring the speech
Outputting “the” only, ignoring the speech
Sometimes gives WER < 100%
Sometimes gives WER < 100%
All words are treated equal
All words are treated equal
“
“This specimen” vs “The specimen” This specimen” vs “The specimen”
“
“Is absent” vs “Is present” Is absent” vs “Is present”
Signal Acquisition Signal Acquisition
High quality signal quality
High quality signal quality
Lower sample rate will increase WER
Lower sample rate will increase WER
8KHz baseline
8KHz baseline
16KHz -10%
16KHz -10%
End-Point Detection End-Point Detection
Long silence will likely increase WER
Long silence will likely increase WER
It will recognize phantom words
It will recognize phantom words
Need to find the speech in the signal
Need to find the speech in the signal
VAD (Voice Activity Detection)
VAD (Voice Activity Detection)
Find beginning and end of speech
Find beginning and end of speech
Typically do continuous recognition
Typically do continuous recognition
Recognized while listening
Recognized while listening
But need end point (have to wait)
But need end point (have to wait)
Feature normalization Feature normalization
Sometimes do normalization
Sometimes do normalization
Remove mean from MFCCs
Remove mean from MFCCs
Can make recognition more reliable in noise
Can make recognition more reliable in noise
Often include deltas and delta deltas
Often include deltas and delta deltas
Sometimes to feature reduction
Sometimes to feature reduction
Principal Component Analysis
Principal Component Analysis
What phones/segments What phones/segments
Need the best set for discrimination
Need the best set for discrimination
Not necessary the same as Linguistic Phones
Not necessary the same as Linguistic Phones
More phones means more training
More phones means more training
And needs to have consistent Lexicon
And needs to have consistent Lexicon
Extra phones
Extra phones
t vs dx
t vs dx
t vs nx: /t w eh n t iy/ vs / t w eh nx iy /
t vs nx: /t w eh n t iy/ vs / t w eh nx iy /
Stops as closures and bursts
Stops as closures and bursts
Schwas: ax and ix
Schwas: ax and ix
Syllabics: el, em, en
Syllabics: el, em, en
Accents/Tones: ah1, ah0, ….
Accents/Tones: ah1, ah0, ….
Context dependency Context dependency
Care about the contexts of each phone
Care about the contexts of each phone
Post vocalic /r/ and /n/ /m/ affect vowel
Post vocalic /r/ and /n/ /m/ affect vowel
Utterances start and end affect phonemes
Utterances start and end affect phonemes
Need more than simple phone models
Need more than simple phone models
Tri-phone Models Tri-phone Models
Have models for each phone and context
Have models for each phone and context
43^3 contexts about 80K models
43^3 contexts about 80K models
Not all contexts have enough examples
Not all contexts have enough examples
oy (oy) oy very rare
- y (oy) oy very rare
sh (ax) n very common
sh (ax) n very common
Merge tri-phones that are similar
Merge tri-phones that are similar
E.g t(ih)n with d(ih)n
E.g t(ih)n with d(ih)n
Find phones to merge Find phones to merge
Using phonetic features
Using phonetic features
Most similar feature, most similar acoustics
Most similar feature, most similar acoustics
Stops, voicing, vowel type …
Stops, voicing, vowel type …
Usually automatic cluster of triphones
Usually automatic cluster of triphones
Using CART trees indexed by phonetic features
Using CART trees indexed by phonetic features
Adaptation Adaptation
Change behavior after use
Change behavior after use
Human adaptation
Human adaptation
They will change how they speak
They will change how they speak
Channel adaptation
Channel adaptation
Cepstral Normalization
Cepstral Normalization
Model adaptation
Model adaptation
Move the means (or weights on means)
Move the means (or weights on means)
Adaptation Adaptation
Assume recognition is correct
Assume recognition is correct
(Maybe with some threshold)
(Maybe with some threshold)
Modify model to make answer more correct
Modify model to make answer more correct
Adaptation to speaker characteristics
Adaptation to speaker characteristics
Adaptation to speaker style
Adaptation to speaker style
Can improve accuracy by a few %
Can improve accuracy by a few %
Pronunciation lexicon Pronunciation lexicon
Need list of words and their pronunciation
Need list of words and their pronunciation
Pencil p eh n s ih l
Pencil p eh n s ih l
Two t uw
Two t uw
Too t uw
Too t uw
…
…
Need pronunciation of ALL words
Need pronunciation of ALL words
What’s a word What’s a word
Basic words are clear
Basic words are clear
What about morphological variants
What about morphological variants
walk, walks, walked, walking
walk, walks, walked, walking
Multi-word words
Multi-word words
Los Angeles, New York
Los Angeles, New York
Contractions
Contractions
Wanna, gonna …
Wanna, gonna …
Yes ALL words that you will recognize
Yes ALL words that you will recognize
Pronunciation variants Pronunciation variants
Homographs: (same writing different
Homographs: (same writing different pronuncation) pronuncation)
bass: / b ae s / (fish) / b ey s / (music)
bass: / b ae s / (fish) / b ey s / (music)
project: N / p r aa jh eh k t / V /p r ax jh eh k t /
project: N / p r aa jh eh k t / V /p r ax jh eh k t /
Natural variants
Natural variants
route: / r uw t / and / r aw t /
route: / r uw t / and / r aw t /
coupon: / k uw p ao n / and / k y uw p ao n /
coupon: / k uw p ao n / and / k y uw p ao n /
water: / w ao t er / and / w ao dx er /
water: / w ao t er / and / w ao dx er /
CMU Pronunciation Dict CMU Pronunciation Dict
Free pronunciation lexicon
Free pronunciation lexicon
American English
American English
Over 100K words
Over 100K words
Not always consistent
Not always consistent
Words for your application will be missing
Words for your application will be missing
We can never get a complete lexicon
We can never get a complete lexicon
Pronunciation of Unknown Words Pronunciation of Unknown Words
Build statistical model from lexicon
Build statistical model from lexicon
Predict pronunciation from letters
Predict pronunciation from letters
(Humans do this when they see a new word)
(Humans do this when they see a new word)
Typically about 70-85% correct for new
Typically about 70-85% correct for new words words
Should always check domain words
Should always check domain words
Modeling Variability Modeling Variability
In Gaussians (in HMM state)
In Gaussians (in HMM state)
Multiple mixtures
Multiple mixtures
In HMM topology
In HMM topology
Number of states and connectivity
Number of states and connectivity
In State Tying
In State Tying
Sharing Gaussians between states
Sharing Gaussians between states
In Phone choice
In Phone choice
More/less phones
More/less phones
In Lexical Pronunciation
In Lexical Pronunciation
Multiple lexical entries