speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Speech Recognition Acoustic - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation


  1. Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

  2. Acoustic Modeling Speech and Signal Variability � Speech and Signal Variability � Measuring Error � Measuring Error � Pronunciation lexicons � Pronunciation lexicons �

  3. Variability in Speech Signal “Mr Mr Wright should write to Ms Wright right Wright should write to Ms Wright right � “ � away about his Ford or four door Honda. away about his Ford or four door Honda. � Homophones: same pronunciation Homophones: same pronunciation � � “ “wright wright” “right” “write” / r ay t / ” “right” “write” / r ay t / � � “ford or” “four door” / f “ford or” “four door” / f ao ao r d r d ao ao r / r / �

  4. Style Variability Different articulation in different situations � Different articulation in different situations � Clear vs vs Conversational Conversational � Clear � Whisper vs vs shouting shouting � Whisper � Talking to machine, talking to others � Talking to machine, talking to others � Frustrated speech � Frustrated speech �

  5. Speaker variability Gender, age, dialect, health � Gender, age, dialect, health � Speaker dependent systems � Speaker dependent systems � Speaker independent systems � Speaker independent systems � Speaker adaptive systems � Speaker adaptive systems � � Enrolment stage (acoustics and language) Enrolment stage (acoustics and language) �

  6. Environment Variability Different background noises � Different background noises � � Office Office vs vs Outside Outside � Different applications, different � Different applications, different � environments environments � Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick � Single speaker vs vs Multispeaker Multispeaker � Single speaker � Background music � Background music �

  7. Channel Variability Telephone vs vs Desktop Desktop � Telephone � � 8KHz 8KHz vs vs 16KHz 16KHz � PDA vs vs Desktop Desktop � PDA � Close- -talking talking vs vs far far- -field field � Close � Cell Phone vs vs Landline Landline � Cell Phone �

  8. Measuring Speech Recognition Error Word Error Rate � Word Error Rate � � Substitutions: word is replaced Substitutions: word is replaced � � Deletions: word is missed out Deletions: word is missed out � � Insertions: word is added Insertions: word is added � Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- ----------------------------------- WER = 100% x word in correct sentence word in correct sentence

  9. Word Error Rate WER requires: � WER requires: � � Transcription (the correct word string) Transcription (the correct word string) � � Alignment between ASR output and Transcript Alignment between ASR output and Transcript � � Not just left to right matching Not just left to right matching � Sometimes Accuracy is given � Sometimes Accuracy is given � � 100 100- -WER WER � � NOT number of words correct NOT number of words correct �

  10. Word Error Rate Can get > 100% � Can get > 100% � � But something is very wrong But something is very wrong � Outputting “the” only, ignoring the speech � Outputting “the” only, ignoring the speech � � Sometimes gives WER < 100% Sometimes gives WER < 100% � All words are treated equal � All words are treated equal � � “This specimen” “This specimen” vs vs “The specimen” “The specimen” � � “Is absent” “Is absent” vs vs “Is present” “Is present” �

  11. Signal Acquisition High quality signal quality � High quality signal quality � � Lower sample rate will increase WER Lower sample rate will increase WER � � 8KHz baseline 8KHz baseline � � 16KHz 16KHz - -10% 10% �

  12. End-Point Detection Long silence will likely increase WER � Long silence will likely increase WER � � It will recognize phantom words It will recognize phantom words � Need to find the speech in the signal � Need to find the speech in the signal � � VAD (Voice Activity Detection) VAD (Voice Activity Detection) � � Find beginning and end of speech Find beginning and end of speech � Typically do continuous recognition � Typically do continuous recognition � � Recognized while listening Recognized while listening � � But need end point (have to wait) But need end point (have to wait) �

  13. Feature normalization Sometimes do normalization � Sometimes do normalization � � Remove mean from Remove mean from MFCCs MFCCs � � Can make recognition more reliable in noise Can make recognition more reliable in noise � Often include deltas and delta deltas � Often include deltas and delta deltas � Sometimes to feature reduction � Sometimes to feature reduction � � Principal Component Analysis Principal Component Analysis �

  14. What phones/segments � Need the best set for discrimination Need the best set for discrimination � � Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones � � More phones means more training More phones means more training � � And needs to have consistent Lexicon And needs to have consistent Lexicon � � Extra phones Extra phones � � t t vs vs dx dx � � t t vs vs nx nx: /t w eh n t : /t w eh n t iy iy/ / vs vs / t w eh / t w eh nx nx iy iy / / � � Stops as closures and bursts Stops as closures and bursts � � Schwas: ax and ix Schwas: ax and ix � � Syllabics: el, Syllabics: el, em em, en , en � � Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, …. �

  15. Context dependency Care about the contexts of each phone � Care about the contexts of each phone � � Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel � � Utterances start and end affect phonemes Utterances start and end affect phonemes � Need more than simple phone models � Need more than simple phone models �

  16. Tri-phone Models Have models for each phone and context � Have models for each phone and context � � 43^3 contexts about 80K models 43^3 contexts about 80K models � Not all contexts have enough examples � Not all contexts have enough examples � � oy oy ( (oy oy) ) oy oy very rare very rare � � sh sh (ax) n very common (ax) n very common � Merge tri- -phones that are similar phones that are similar � Merge tri � � E.g E.g t(ih)n t(ih)n with with d(ih)n d(ih)n �

  17. Find phones to merge Using phonetic features � Using phonetic features � � Most similar feature, most similar acoustics Most similar feature, most similar acoustics � � Stops, voicing, vowel type … Stops, voicing, vowel type … � Usually automatic cluster of triphones triphones � Usually automatic cluster of � � Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features �

  18. Adaptation Change behavior after use � Change behavior after use � Human adaptation � Human adaptation � � They will change how they speak They will change how they speak � Channel adaptation � Channel adaptation � � Cepstral Cepstral Normalization Normalization � Model adaptation � Model adaptation � � Move the means (or weights on means) Move the means (or weights on means) �

  19. Adaptation Assume recognition is correct � Assume recognition is correct � � (Maybe with some threshold) (Maybe with some threshold) � Modify model to make answer more correct � Modify model to make answer more correct � � Adaptation to speaker characteristics Adaptation to speaker characteristics � � Adaptation to speaker style Adaptation to speaker style � � Can improve accuracy by a few % Can improve accuracy by a few % �

  20. Pronunciation lexicon Need list of words and their pronunciation � Need list of words and their pronunciation � � Pencil p eh n s Pencil p eh n s ih ih l l � � Two t Two t uw uw � � Too t Too t uw uw � � … … � Need pronunciation of ALL words � Need pronunciation of ALL words �

  21. What’s a word Basic words are clear � Basic words are clear � What about morphological variants � What about morphological variants � � walk, walks, walked, walking walk, walks, walked, walking � Multi- -word words word words � Multi � � Los Angeles, New York Los Angeles, New York � Contractions � Contractions � � Wanna Wanna, , gonna gonna … … � Yes ALL words that you will recognize � Yes ALL words that you will recognize �

  22. Pronunciation variants Homographs: (same writing different � Homographs: (same writing different � pronuncation) ) pronuncation � bass: / b bass: / b ae ae s / (fish) / b s / (fish) / b ey ey s / (music) s / (music) � � project: N / p r project: N / p r aa aa jh jh eh k t / V /p r ax eh k t / V /p r ax jh jh eh k t / eh k t / � Natural variants � Natural variants � � route: / r route: / r uw uw t / and / r aw t / t / and / r aw t / � � coupon: / k coupon: / k uw uw p p ao ao n / and / k y n / and / k y uw uw p p ao ao n / n / � � water: / w water: / w ao ao t t er er / and / w / and / w ao ao dx dx er er / / �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend