lecture 6

Lecture 6 pr@n2nsi"eIS@n "mAd@lIN Michael Picheny, - PowerPoint PPT Presentation

Lecture 6 pr@n2nsi"eIS@n "mAd@lIN Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA


  1. Lecture 6 pr@­n2nsi"eIS@n "mAd@lIN Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 24 February 2016 and Mar 2 2016

  2. Administrivia Lab 2 Not graded yet; handed back next lecture. Lab 3 Due nine days from now (Friday, Mar. 11) at 6pm. 2 / 96

  3. Lab Grading How things work: Overall scale: -1 to +1 (-2 if don’t hand in). Programming part: max score +0.5. Short answers: default score 0, with small bonus/pens. +0.5 bonus: if total pens of at most 0.2 (think √ + ). Lab 1: +0.5 bonus wasn’t applied in original grading. If your score changed, should have recv’d E-mail. Contact Stan if you still have questions. 3 / 96

  4. Feedback Clear (4), mostly clear (3), unclear (1). Pace: fast (2), OK (1). Muddiest: pronunciation modeling (1), Laplace smoothing (1). Comments (2+ votes) Handing out grades distracting, inefficient (2). 4 / 96

  5. Review to date Learned about features (MFCCs, etc.) Learned about Gaussian Mixture Models Learned about HMMs and basic operations (finding best path, training models) Learned about basic Language modeling. 5 / 96

  6. Where Are We? How to Model Pronunciation Using HMM Topologies 1 Modeling Context Dependence via Decision Trees 2 6 / 96

  7. Where Are We? How to Model Pronunciation Using HMM Topologies 1 Whole Word Models Phonetic Models Context-Dependence 7 / 96

  8. In the beginning... ... . was the whole word model. For each word in the vocabulary, decide on an HMM structure. Often the number of states in the model is chosen to be proportional to the number of phonemes in the word. Train the HMM parameters for a given word using examples of that word in the training data. Good domain for this approach: digits. 8 / 96

  9. Example topologies: Digits Vocabulary consists of (“zero”, “oh”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”). Assume we assign two states per phoneme. Models look like: “zero”. “oh”. 9 / 96

  10. 10 / 96

  11. 11 / 96

  12. How to represent any sequence of digits? 12 / 96

  13. “911” 13 / 96

  14. Whole-word model limitations The whole-word model suffers from two main problems. Cannot model unseen words. In fact, we need several samples of each word to train the models properly. Cannot share data among models – data sparseness problem. The number of parameters in the system is proportional to the vocabulary size. Thus, whole-word models are best on small vocabulary tasks with lots of data per word. n.b. as the amount of public speech data continues to increase this wisdom may be thrown into question. 14 / 96

  15. Where Are We? How to Model Pronunciation Using HMM Topologies 1 Whole Word Models Phonetic Models Context-Dependence 15 / 96

  16. Subword Units To reduce the number of parameters, we can compose word models from sub-word units. These units can be shared among words. Examples include Units Approximate number Phones 50. Diphones 2000. Syllables 5,000. Each unit is small in terms of amount of speech modeled. The number of parameters is proportional to the number of units (not the number of words in the vocabulary as in whole-word models.). 16 / 96

  17. Phonetic Models We represent each word as a sequence of phonemes. This representation is the “baseform” for the word. BANDS -> B AE N D Z Some words need more than one baseform. THE -> DH UH -> DH IY 17 / 96

  18. Baseform Dictionary To determine the pronunciation of each word, we look it up in a dictionary. Each word may have several possible pronunciations. Every word in our training script and test vocabulary must be in the dictionary. The dictionary is generally written by hand. Prone to errors and inconsistencies. 18 / 96

  19. Phonetic Models, cont’d We can allow for a wide variety of phonological variation by representing baseforms as graphs. 19 / 96

  20. Phonetic Models, cont’d Now, construct a Markov model for each phone. Examples: 20 / 96

  21. Embedding Replace each phone by its Markov model to get a model for the entire word 21 / 96

  22. Reducing Parameters by Tying Consider the three-state model. Note that. t 1 and t 2 correspond to the beginning of the phone. t 3 and t 4 correspond to the middle of the phone. t 5 and t 6 correspond to the end of the phone. If we force the output distributions for each member of those pairs to be the same, then the training data requirements are reduced. 22 / 96

  23. Tying A set of arcs in a Markov model are tied to one another if they are constrained to have identical output distributions. Similarly, states are tied if they have identical transition probabilities. Tying can be explicit or implicit. 23 / 96

  24. Implicit Tying Occurs when we build up models for larger units from models of smaller units. Example: when word models are made from phone models. First, consider an example without any tying. Let the vocabulary consist of digits 0,1,2,... 9. We can make a separate model for each word. To estimate parameters for each word model, we need several samples for each word. Samples of “0” affect only parameters for the “0” model. 24 / 96

  25. Implicit Tying, cont’d Now consider phone-based models for this vocabulary. Training samples of “0” will also affect models for “3” and “4”. Useful in large vocabulary systems where the number of words is much greater than the number of phones. 25 / 96

  26. Explicit Tying Example: 6 non-null arcs, but only 3 different output distributions because of tying. Number of model parameters is reduced. Tying saves storage because only one copy of each distribution is saved. Fewer parameters mean less training data needed. 26 / 96

  27. Where Are We? How to Model Pronunciation Using HMM Topologies 1 Whole Word Models Phonetic Models Context-Dependence 27 / 96

  28. Variations in realizations of phonemes The broad units, phonemes, have variants known as allophones Example: p and p h (un-aspirated and aspirated p ). Exercise: Put your hand in front of your mouth and pronounce spin and then pin Note that the p in pin has a puff of air,. while the p in spin does not. 28 / 96

  29. Variations in realizations of phonemes Articulators have inertia, thus the pronunciation of a phoneme is influenced by surrounding phonemes. This is known as co-articulation Example: Consider k in different contexts. In keep the whole body of the tongue has to be pulled up to make the vowel. Closure of the k moves forward compared to coop 29 / 96

  30. keep 30 / 96

  31. coop 31 / 96

  32. Phoneme Targets Phonemes have idealized articulator target positions that may or may not be reached in a particular utterance. Speaking rate Clarity of articulation How do we model all this variation? 32 / 96

  33. Triphone models Model each phoneme in the context of its left and right neighbor. E.g. K IY P is a model for IY when K is its left context phoneme and P is its right context phoneme. "keep" → K IY P → wb K IY K IY P IY P wb If we have 50 phonemes in a language, we could have as many as 50 3 triphones to model. Not all of these occur, or only occur a few times. Why is this bad? Suggestion: Combine similar triphones together For example, map K IY P and K IY F to common model 33 / 96

  34. "Bottom-up" (Agglomerative) Clustering Start with each item in a cluster by itself. Find “closest” pair of items. Merge them into a single cluster. Iterate. 34 / 96

  35. Triphone Clustering Helps with data sparsity issue BUT still have an issue with unseen data To model unseen events, we can “back-off” to lower order models such as bi-phones and uni-phones. But this is still sort of ugly. So instead, we use Decision Trees to deal with the sparse/unknown data problem. 35 / 96

  36. Where Are We? How to Model Pronunciation Using HMM Topologies 1 Modeling Context Dependence via Decision Trees 2 36 / 96

  37. Where Are We? Modeling Context Dependence via Decision Trees 2 Decision Tree Overview Letter-to-Sound Example Basics of Tree Construction Criterion Function Details of Context Dependent Modeling 37 / 96

  38. Decision Trees 38 / 96

  39. OK. What’s a decision tree? 39 / 96

  40. Types of Features Nominal or categorical: Finite set without any natural ordering (e.g., occupation, marital status, race). Ordinal: Ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury). Numerical: Domain is numerically ordered (e.g., age, income). 40 / 96

  41. Types of Outputs Categorical: Output is one of N classes Diagnosis: Predict disease from symptoms Language Modeling: Predict next word from previous words in the sentence Spelling to sound rules: Predict phone from spelling Continuous: Output is a continuous vector Allophonic variation: Predict spectral characteristics from phone context 41 / 96

  42. Where Are We? Modeling Context Dependence via Decision Trees 2 Decision Tree Overview Letter-to-Sound Example Basics of Tree Construction Criterion Function Details of Context Dependent Modeling 42 / 96

Recommend


More recommend