 
              Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary
Acoustic Modeling Acoustic Modeling  Speech and Signal Variability Speech and Signal Variability  Measuring Error Measuring Error  Pronunciation lexicons Pronunciation lexicons
Variability in Speech Signal Variability in Speech Signal  “ “Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.  Homophones: same pronunciation Homophones: same pronunciation  “ “wright” “right” “write” / r ay t / wright” “right” “write” / r ay t /  “ “ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /
Style Variability Style Variability  Different articulation in different situations Different articulation in different situations  Clear vs Conversational Clear vs Conversational  Whisper vs shouting Whisper vs shouting  Talking to machine, talking to others Talking to machine, talking to others  Frustrated speech Frustrated speech
Speaker variability Speaker variability  Gender, age, dialect, health Gender, age, dialect, health  Speaker dependent systems Speaker dependent systems  Speaker independent systems Speaker independent systems  Speaker adaptive systems Speaker adaptive systems  Enrolment stage (acoustics and language) Enrolment stage (acoustics and language)
Environment Variability Environment Variability  Different background noises Different background noises  Office vs Outside Office vs Outside  Different applications, different Different applications, different environments environments  Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick  Single speaker vs Multispeaker Single speaker vs Multispeaker  Background music Background music
Channel Variability Channel Variability  Telephone vs Desktop Telephone vs Desktop  8KHz vs 16KHz 8KHz vs 16KHz  Mobile vs Desktop Mobile vs Desktop  Close-talking vs far-field Close-talking vs far-field  Cell Phone vs Landline vs VOIP Cell Phone vs Landline vs VOIP
Measuring Speech Recognition Error Measuring Speech Recognition Error  Word Error Rate Word Error Rate  Substitutions: word is replaced Substitutions: word is replaced  Deletions: word is missed out Deletions: word is missed out  Insertions: word is added Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence
Word Error Rate Word Error Rate  WER requires: WER requires:  Transcription (the correct word string) Transcription (the correct word string)  Alignment between ASR output and Transcript Alignment between ASR output and Transcript  Not just left to right matching Not just left to right matching  Sometimes Accuracy is given Sometimes Accuracy is given  100-WER 100-WER  NOT number of words correct NOT number of words correct
Word Error Rate Word Error Rate  Can get > 100% Can get > 100%  But something is very wrong But something is very wrong  Outputting “the” only, ignoring the speech Outputting “the” only, ignoring the speech  Sometimes gives WER < 100% Sometimes gives WER < 100%  All words are treated equal All words are treated equal  “ “This specimen” vs “The specimen” This specimen” vs “The specimen”  “ “Is absent” vs “Is present” Is absent” vs “Is present”
Signal Acquisition Signal Acquisition  High quality signal quality High quality signal quality  Lower sample rate will increase WER Lower sample rate will increase WER  8KHz baseline 8KHz baseline  16KHz -10% 16KHz -10%
End-Point Detection End-Point Detection  Long silence will likely increase WER Long silence will likely increase WER  It will recognize phantom words It will recognize phantom words  Need to find the speech in the signal Need to find the speech in the signal  VAD (Voice Activity Detection) VAD (Voice Activity Detection)  Find beginning and end of speech Find beginning and end of speech  Typically do continuous recognition Typically do continuous recognition  Recognized while listening Recognized while listening  But need end point (have to wait) But need end point (have to wait)
Feature normalization Feature normalization  Sometimes do normalization Sometimes do normalization  Remove mean from MFCCs Remove mean from MFCCs  Can make recognition more reliable in noise Can make recognition more reliable in noise  Often include deltas and delta deltas Often include deltas and delta deltas  Sometimes to feature reduction Sometimes to feature reduction  Principal Component Analysis Principal Component Analysis
What phones/segments What phones/segments  Need the best set for discrimination Need the best set for discrimination  Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones  More phones means more training More phones means more training  And needs to have consistent Lexicon And needs to have consistent Lexicon  Extra phones Extra phones  t vs dx t vs dx  t vs nx: /t w eh n t iy/ vs / t w eh nx iy / t vs nx: /t w eh n t iy/ vs / t w eh nx iy /  Stops as closures and bursts Stops as closures and bursts  Schwas: ax and ix Schwas: ax and ix  Syllabics: el, em, en Syllabics: el, em, en  Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, ….
Context dependency Context dependency  Care about the contexts of each phone Care about the contexts of each phone  Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel  Utterances start and end affect phonemes Utterances start and end affect phonemes  Need more than simple phone models Need more than simple phone models
Tri-phone Models Tri-phone Models  Have models for each phone and context Have models for each phone and context  43^3 contexts about 80K models 43^3 contexts about 80K models  Not all contexts have enough examples Not all contexts have enough examples  oy (oy) oy very rare oy (oy) oy very rare  sh (ax) n very common sh (ax) n very common  Merge tri-phones that are similar Merge tri-phones that are similar  E.g t(ih)n with d(ih)n E.g t(ih)n with d(ih)n
Find phones to merge Find phones to merge  Using phonetic features Using phonetic features  Most similar feature, most similar acoustics Most similar feature, most similar acoustics  Stops, voicing, vowel type … Stops, voicing, vowel type …  Usually automatic cluster of triphones Usually automatic cluster of triphones  Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features
Adaptation Adaptation  Change behavior after use Change behavior after use  Human adaptation Human adaptation  They will change how they speak They will change how they speak  Channel adaptation Channel adaptation  Cepstral Normalization Cepstral Normalization  Model adaptation Model adaptation  Move the means (or weights on means) Move the means (or weights on means)
Adaptation Adaptation  Assume recognition is correct Assume recognition is correct  (Maybe with some threshold) (Maybe with some threshold)  Modify model to make answer more correct Modify model to make answer more correct  Adaptation to speaker characteristics Adaptation to speaker characteristics  Adaptation to speaker style Adaptation to speaker style  Can improve accuracy by a few % Can improve accuracy by a few %
Pronunciation lexicon Pronunciation lexicon  Need list of words and their pronunciation Need list of words and their pronunciation  Pencil p eh n s ih l Pencil p eh n s ih l  Two t uw Two t uw  Too t uw Too t uw  … …  Need pronunciation of ALL words Need pronunciation of ALL words
What’s a word What’s a word  Basic words are clear Basic words are clear  What about morphological variants What about morphological variants  walk, walks, walked, walking walk, walks, walked, walking  Multi-word words Multi-word words  Los Angeles, New York Los Angeles, New York  Contractions Contractions  Wanna, gonna … Wanna, gonna …  Yes ALL words that you will recognize Yes ALL words that you will recognize
Pronunciation variants Pronunciation variants  Homographs: (same writing different Homographs: (same writing different pronuncation) pronuncation)  bass: / b ae s / (fish) / b ey s / (music) bass: / b ae s / (fish) / b ey s / (music)  project: N / p r aa jh eh k t / V /p r ax jh eh k t / project: N / p r aa jh eh k t / V /p r ax jh eh k t /  Natural variants Natural variants  route: / r uw t / and / r aw t / route: / r uw t / and / r aw t /  coupon: / k uw p ao n / and / k y uw p ao n / coupon: / k uw p ao n / and / k y uw p ao n /  water: / w ao t er / and / w ao dx er / water: / w ao t er / and / w ao dx er /
Recommend
More recommend