speech processing 11 492 18 492 speech processing 11 492
play

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring


  1. Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

  2. Acoustic Modeling Acoustic Modeling  Speech and Signal Variability Speech and Signal Variability  Measuring Error Measuring Error  Pronunciation lexicons Pronunciation lexicons

  3. Variability in Speech Signal Variability in Speech Signal  “ “Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.  Homophones: same pronunciation Homophones: same pronunciation  “ “wright” “right” “write” / r ay t / wright” “right” “write” / r ay t /  “ “ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /

  4. Style Variability Style Variability  Different articulation in different situations Different articulation in different situations  Clear vs Conversational Clear vs Conversational  Whisper vs shouting Whisper vs shouting  Talking to machine, talking to others Talking to machine, talking to others  Frustrated speech Frustrated speech

  5. Speaker variability Speaker variability  Gender, age, dialect, health Gender, age, dialect, health  Speaker dependent systems Speaker dependent systems  Speaker independent systems Speaker independent systems  Speaker adaptive systems Speaker adaptive systems  Enrolment stage (acoustics and language) Enrolment stage (acoustics and language)

  6. Environment Variability Environment Variability  Different background noises Different background noises  Office vs Outside Office vs Outside  Different applications, different Different applications, different environments environments  Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick  Single speaker vs Multispeaker Single speaker vs Multispeaker  Background music Background music

  7. Channel Variability Channel Variability  Telephone vs Desktop Telephone vs Desktop  8KHz vs 16KHz 8KHz vs 16KHz  Mobile vs Desktop Mobile vs Desktop  Close-talking vs far-field Close-talking vs far-field  Cell Phone vs Landline vs VOIP Cell Phone vs Landline vs VOIP

  8. Measuring Speech Recognition Error Measuring Speech Recognition Error  Word Error Rate Word Error Rate  Substitutions: word is replaced Substitutions: word is replaced  Deletions: word is missed out Deletions: word is missed out  Insertions: word is added Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence

  9. Word Error Rate Word Error Rate  WER requires: WER requires:  Transcription (the correct word string) Transcription (the correct word string)  Alignment between ASR output and Transcript Alignment between ASR output and Transcript  Not just left to right matching Not just left to right matching  Sometimes Accuracy is given Sometimes Accuracy is given  100-WER 100-WER  NOT number of words correct NOT number of words correct

  10. Word Error Rate Word Error Rate  Can get > 100% Can get > 100%  But something is very wrong But something is very wrong  Outputting “the” only, ignoring the speech Outputting “the” only, ignoring the speech  Sometimes gives WER < 100% Sometimes gives WER < 100%  All words are treated equal All words are treated equal  “ “This specimen” vs “The specimen” This specimen” vs “The specimen”  “ “Is absent” vs “Is present” Is absent” vs “Is present”

  11. Signal Acquisition Signal Acquisition  High quality signal quality High quality signal quality  Lower sample rate will increase WER Lower sample rate will increase WER  8KHz baseline 8KHz baseline  16KHz -10% 16KHz -10%

  12. End-Point Detection End-Point Detection  Long silence will likely increase WER Long silence will likely increase WER  It will recognize phantom words It will recognize phantom words  Need to find the speech in the signal Need to find the speech in the signal  VAD (Voice Activity Detection) VAD (Voice Activity Detection)  Find beginning and end of speech Find beginning and end of speech  Typically do continuous recognition Typically do continuous recognition  Recognized while listening Recognized while listening  But need end point (have to wait) But need end point (have to wait)

  13. Feature normalization Feature normalization  Sometimes do normalization Sometimes do normalization  Remove mean from MFCCs Remove mean from MFCCs  Can make recognition more reliable in noise Can make recognition more reliable in noise  Often include deltas and delta deltas Often include deltas and delta deltas  Sometimes to feature reduction Sometimes to feature reduction  Principal Component Analysis Principal Component Analysis

  14. What phones/segments What phones/segments  Need the best set for discrimination Need the best set for discrimination  Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones  More phones means more training More phones means more training  And needs to have consistent Lexicon And needs to have consistent Lexicon  Extra phones Extra phones  t vs dx t vs dx  t vs nx: /t w eh n t iy/ vs / t w eh nx iy / t vs nx: /t w eh n t iy/ vs / t w eh nx iy /  Stops as closures and bursts Stops as closures and bursts  Schwas: ax and ix Schwas: ax and ix  Syllabics: el, em, en Syllabics: el, em, en  Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, ….

  15. Context dependency Context dependency  Care about the contexts of each phone Care about the contexts of each phone  Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel  Utterances start and end affect phonemes Utterances start and end affect phonemes  Need more than simple phone models Need more than simple phone models

  16. Tri-phone Models Tri-phone Models  Have models for each phone and context Have models for each phone and context  43^3 contexts about 80K models 43^3 contexts about 80K models  Not all contexts have enough examples Not all contexts have enough examples  oy (oy) oy very rare oy (oy) oy very rare  sh (ax) n very common sh (ax) n very common  Merge tri-phones that are similar Merge tri-phones that are similar  E.g t(ih)n with d(ih)n E.g t(ih)n with d(ih)n

  17. Find phones to merge Find phones to merge  Using phonetic features Using phonetic features  Most similar feature, most similar acoustics Most similar feature, most similar acoustics  Stops, voicing, vowel type … Stops, voicing, vowel type …  Usually automatic cluster of triphones Usually automatic cluster of triphones  Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features

  18. Adaptation Adaptation  Change behavior after use Change behavior after use  Human adaptation Human adaptation  They will change how they speak They will change how they speak  Channel adaptation Channel adaptation  Cepstral Normalization Cepstral Normalization  Model adaptation Model adaptation  Move the means (or weights on means) Move the means (or weights on means)

  19. Adaptation Adaptation  Assume recognition is correct Assume recognition is correct  (Maybe with some threshold) (Maybe with some threshold)  Modify model to make answer more correct Modify model to make answer more correct  Adaptation to speaker characteristics Adaptation to speaker characteristics  Adaptation to speaker style Adaptation to speaker style  Can improve accuracy by a few % Can improve accuracy by a few %

  20. Pronunciation lexicon Pronunciation lexicon  Need list of words and their pronunciation Need list of words and their pronunciation  Pencil p eh n s ih l Pencil p eh n s ih l  Two t uw Two t uw  Too t uw Too t uw  … …  Need pronunciation of ALL words Need pronunciation of ALL words

  21. What’s a word What’s a word  Basic words are clear Basic words are clear  What about morphological variants What about morphological variants  walk, walks, walked, walking walk, walks, walked, walking  Multi-word words Multi-word words  Los Angeles, New York Los Angeles, New York  Contractions Contractions  Wanna, gonna … Wanna, gonna …  Yes ALL words that you will recognize Yes ALL words that you will recognize

  22. Pronunciation variants Pronunciation variants  Homographs: (same writing different Homographs: (same writing different pronuncation) pronuncation)  bass: / b ae s / (fish) / b ey s / (music) bass: / b ae s / (fish) / b ey s / (music)  project: N / p r aa jh eh k t / V /p r ax jh eh k t / project: N / p r aa jh eh k t / V /p r ax jh eh k t /  Natural variants Natural variants  route: / r uw t / and / r aw t / route: / r uw t / and / r aw t /  coupon: / k uw p ao n / and / k y uw p ao n / coupon: / k uw p ao n / and / k y uw p ao n /  water: / w ao t er / and / w ao dx er / water: / w ao t er / and / w ao dx er /

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend