acoustic modeling
play

Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., - PowerPoint PPT Presentation

Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 9 2. The HTK Book Definition of Speech Recognition Problem For the given acoustic observation X = X 1 X 2 X n , the goal of speech


  1. Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 9 2. The HTK Book

  2. Definition of Speech Recognition Problem � For the given acoustic observation X = X 1 X 2 … X n , the goal of speech recognition is to find out the corresponding word sequence W = w 1 w 2 … w m that has the maximum posterior probability P ( W | X ) ( ) ˆ = W arg max P W X W = W w w ... w ...w ) ( ) ( 1 2 i m P W P X W { } ∈ = where w V : v , v , ...,v arg max ( ) i 1 2 N P X W ) ( ) ( = arg max P W P X W W Language Modeling Acoustic Modeling 2

  3. Major Challenges � The practical challenge is how to build accurate acoustic models, P ( X | W ), and language models, P ( W ), that can truly reflect the spoken language to be recognized – For large vocabulary speech recognition, there are a large number of words. We need to decompose a word into a subword sequence, thus P ( X | W ) is closely related to phonetic modeling – P ( X | W ) should take into account speaker variations, pronunciation variations, environment variations, and context-dependent phonetic coarticulation variations – Any static acoustic or language model will not meet the needs of real applications, so it is vital to dynamically adapt both P ( X | W ) and P ( W ) to maximize P ( W | X ) � The decoding process of finding the best word sequence W to match the input speech signal X in speech recognition systems is more than a simple pattern recognition problem since there are an infinite number of word patterns to search 3

  4. Variability in the Speech Signal � Context Variability – Context variability at the word/sentence level • E.g. “ Mr. Wright should write to Ms. Wright right away about his Ford or four door Honda ” • Same pronunciation but different meaning: Wright, write, right • Phonetically identical and semantically relevant: Ford or, four door – Context variability at the phonetic level • The acoustic realization of phoneme / ee / for word peat and wheel depends on its left and right context fast speech or spontaneous speech ? 4

  5. Variability in the Speech Signal (cont.) � Style Variability – Isolated speech recognition • Users have to pause between each word • Eliminate errors such as Ford or and four door • A significant reduction in computational complexity • Unnatural to most people • The throughput is significant lower than that for continuous speech – Continuous speech recognition • Error rate for causal, spontaneous, and conversational speech is higher than for carefully articulated read speech • The higher the speaking rate, the higher the error rate • Emotional changes cause more significantly variations 5

  6. Variability in the Speech Signal (cont.) � Speaker Variability – Inter-speaker differences • vocal tract size, length and width of the neck and a range of physical characteristics: gender, age, dialect, health, education, and personal style – The same speaker is often unable to precisely produce the same utterance • The shape of the vocal tract movement and rate of delivery may vary from utterance to utterance – Speaker-independent (SI) speech recognition • Large performance fluctuations among different speakers • Speakers with accents have higher error rates – Speaker-dependent (SD) speech recognition • With the SD data and training, the system can capture the SD acoustic characteristics and, thus, can improve the recognition accuracy • A typically SD speech recognition system can reduce the word recognition error by more than 30% as compared with a comparable SI speech recognition system 6

  7. Variability in the Speech Signal (cont.) � Environment Variability – The world we live in is full of sounds of varying loudness from different sources • We have to deal with various background sounds (noises) – In mobile environments, the spectrum of noises varies significantly because the speaker moves around – Noise may also be present from the input device itself, such as microphone and A/D interference noises – We can reduce the error rates by using multi-style training or adaptive techniques – Environment variability remains as one of the most severe challenges facing today’s state-of-the-art speech systems 7

  8. Evaluation of Automatic Speech Recognition � Performance evaluation of speech recognition systems is critical, and the Word recognition Error Rate (WER) is one of the most important measures � There are typically three types of word recognition errors – Substitution – Deletion Correct: “the effect is clear” – Insertion Recognized: “effect is not clear” one deletion and one insertion + + Subs Dels Ins = Word Error Rate 100% No. of words in the correct sentence � Calculating the WER by aligning the correct word string against the recognized word string – Maximum substring matching problem – Handled by the dynamic programming algorithm 8

  9. Algorithm to Measure the WER //denotes for the word length of the correct sentence //denotes for the word length of the recognized sentence Rec j Cor i //Two common settings of error penalties subPen = 10; /* HTK error penalties */ delPen = 7; insPen = 7; subPenNIST = 4;/* NIST error penalties*/ delPenNIST = 3; insPenNIST = 3; Presentation topic: Write a tool to calculate speech recognition accuracy of the 2nd project, and give a presentation to introduce your algorithm and source codes. 9

  10. Signal Processing – Extracting Features � Signal Acquisition – microphone + PC soundcard (sampling rate) � End-Point Detection push and hold while talking – We can use either push to talk or continuously listening to activate speech signal acquisition Need a speech end-point detector � MFCC and Its Dynamic Features – Time-domain features vs. Frequency-domain features – Capture temporal changes by using delta coefficients � Feature Transformation – We can transform the feature vectors to improve class separability – We can use a number of dimension reduction techniques to map the feature vectors into more effective representations, e.g. principal component analysis (PCA), linear discriminant analysis (LDA), etc Presentation topic: LDA for speech recognition 10

  11. Phonetic Modeling – Selecting Appropriate Units � For general-purpose large vocabulary speech recognition, it is difficult to build whole-word models because – Every new task contains novel words without available training data, such as proper nouns and newly invented words – There is simply too many words, and these different words may have different acoustic realization. It is unlikely to have sufficient repetitions of all words in various contexts � Issues in choosing appropriate modeling units – Accurate : accurately represent the acoustic realization that appears in different contexts – Trainable : have enough data to estimate the parameters of the unit (HMM model parameters) – Generalizable : any new word can be derived from a predefined unit inventory for task-independent speech recognition 11

  12. Comparison of Different Units � Word vs. Subword – Word : semantic meaning, capturing within-word coarticulation, accurate if enough data are available, trainable only for small tasks, not generalizable • For small-vocabulary speech recognition, e.g. digit recognition, whole word models are both accurate and trainable but there is no need to be generalizable – Phone : more trainable and generalizable, but less accurate • There are only about 50 phones in English and 30 in Mandarin Chinese • The realization of a phoneme is strongly affected by its immediately neighboring phonemes – Syllable : a compromise between the word and phonetic models. • Syllables are larger than phones. • There only about 1,300 tone-dependent syllables in Mandarin Chinese and 50 in Japanese, which makes syllable a suitable unit for these languages • The large number of syllables (over 30,000) in English presents a challenge in term of trainability 12

  13. Context Dependency � Phone and Phoneme – In speech science, the term phoneme is used to denote any of the minimal units of speech sound in a language that can serve to distinguish one word from another – The term phone is used to denote a phoneme’s acoustic realization – E.g. English phoneme /t/ has two very different acoustic realizations in the word sa t and me t er. We had better treat them as two different phones when building a spoken language system � Why Context Dependency? – If we make unit context dependent, we can significantly improve the recognition accuracy, provided there are enough training data – A context usually refers to the immediate left and/or right neighboring phones 13

  14. Context Dependency (cont.) � Triphone (Intra-word triphone) – A triphone model is a phonetic model that takes into consideration both the left and right neighboring phones – Two phones having the same identity but different left or right contexts are considered different triphones – Triphone models capture the most important co-articulatory effects – Trainability is a challenging issue. We need to balance trainability and accuracy with a number of parameter-sharing techniques � Modeling inter-word context-dependent phones is complicated – The juncture effect on word boundaries is one of the most serious coarticulation phenomena in continuous speech 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend