lecture 5
play

Lecture 5 The Big Picture/Language Modeling Michael Picheny, - PowerPoint PPT Presentation

Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA


  1. Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 17 February 2016

  2. Administrivia Slides posted before lecture may not match lecture. Lab 1 Not graded yet; will be graded by next lecture? Awards ceremony for evaluation next week. Grading: what’s up with the optional exercises? Lab 2 Due nine days from now (Friday, Feb. 26) at 6pm. Start early! Avail yourself of Piazza. 2 / 99

  3. Feedback Clear (4); mostly clear (2); unclear (3). Pace: fast (3); OK (2). Muddiest: HMM’s in general (1); Viterbi (1); FB (1). Comments (2+ votes): want better/clearer examples (5) spend more time walking through examples (3) spend more time on high-level intuition before getting into details (3) good examples (2) 3 / 99

  4. Celebrity Sighting New York Times 4 / 99

  5. Part I The HMM/GMM Framework 5 / 99

  6. Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 6 / 99

  7. The Raw Data 0.5 0 −0.5 −1 0 0.5 1 1.5 2 2.5 4 x 10 What do we do with waveforms? 7 / 99

  8. Front End Processing Convert waveform to features . 8 / 99

  9. What Have We Gained? Time domain ⇒ frequency domain. Removed vocal-fold excitation. Made features independent. 9 / 99

  10. ASR 1.0: Dynamic Time Warping 10 / 99

  11. Computing the Distance Between Utterances Find “best” alignment between frames. Sum distances between aligned frames. Sum penalties for “weird” alignments. 11 / 99

  12. ASR 2.0: The HMM/GMM Framework 12 / 99

  13. Notation 13 / 99

  14. How Do We Do Recognition? x test = test features; P ω ( x ) = word model. (answer) =??? (answer) = arg max P ω ( x test ) ω ∈ vocab Return the word whose model . . . Assigns the highest prob to the utterance. 14 / 99

  15. Putting it All Together P ω ( x ) = ??? How do we actually train? How do we actually decode ? It’s a puzzlement by jubgo . Some rights reserved. 15 / 99

  16. Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 16 / 99

  17. So What’s the Model? P ω ( x ) =??? Frequency that word ω generates features x . Has something to do with HMM’s and GMM’s. Untitled by Daniel Oines . Some rights reserved. 17 / 99

  18. A Word Is A Sequence of Sounds e.g. , the word ONE : W → AH → N . Phoneme inventory. AA AE AH AO AW AX AXR AY B BD CH D DD DH DX EH ER EY F G GD HH IH IX IY JH K KD L M N NG OW OY P PD R S SH T TD TH TS UH UW V W X Y Z ZH What sounds make up TWO ? What do we use to model sequences? 18 / 99

  19. HMM, v1.0 Outputs on arcs, not states. What’s the problem? What are the outputs? 19 / 99

  20. HMM, v2.0 What’s the problem? How many frames per phoneme? 20 / 99

  21. HMM, v3.0 Are we done? 21 / 99

  22. Concept: Alignment ⇔ Path Path through HMM ⇒ sequence of arcs, one per frame. Notation: A = a 1 · · · a T . a t = which arc generated frame t . 22 / 99

  23. The Game Plan Express P ω ( x ) , the total prob of x . . . In terms of P ω ( x , A ) , the prob of a single path. How? � P ( x ) = (path prob) paths A � = P ( x , A ) paths A Sum over all paths. 23 / 99

  24. How To Compute the Likelihood of a Path? Path: A = a 1 · · · a T . T � P ( x , A ) = (arc prob) × (output prob) t = 1 T � p a t × P ( � x t | a t ) = t = 1 Multiply arc, output probs along path. 24 / 99

  25. What Do Output Probabilities Look Like? Mixture of diagonal-covariance Gaussians. � � P ( � x | a ) = (mixture wgt) (Gaussian for dim d ) comp j dim d � � = p a , j N ( x d ; µ a , j , d , σ a , j , d ) comp j dim d 25 / 99

  26. The Full Model � P ( x ) = P ( x , A ) paths A T � � p a t × P ( � = x t | a t ) paths A t = 1 T � � � � N ( x t , d ; µ a t , j , d , σ 2 = p a t p a t , j a t , j , d ) paths A t = 1 comp j dim d p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 26 / 99

  27. Pop Quiz What was the equation on the last slide? 27 / 99

  28. Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 28 / 99

  29. Training How to create model P ω ( x ) from examples x ω, 1 , x ω, 2 , . . . ? 29 / 99

  30. What is the Goal of Training? To estimate parameters . . . To maximize likelihood of training data. Crossfit 0303 by Runar Eilertsen . Some rights reserved. 30 / 99

  31. What Are the Model Parameters? p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 31 / 99

  32. Warm-Up: Non-Hidden ML Estimation e.g. , Gaussian estimation, non-hidden Markov Models. How to do this? (Hint: ??? and ???.) parameter description statistic p a arc prob # times arc taken p a , j mixture wgt # times component used µ a , j , d mean x d σ 2 x 2 variance a , j , d d Count and normalize. i.e. , collect a statistic; divide by normalizer count. 32 / 99

  33. How To Estimate Hidden Models? The EM algorithm ⇒ FB algorithm for HMM’s. Hill-climbing maximum-likelihood estimation. Uphill Struggle by Ewan Cross . Some rights reserved. 33 / 99

  34. The EM Algorithm Expectation step. Using current model, compute posterior counts . . . Prob that thing occurred at time t . Maximization step. Like non-hidden MLE, except . . . Use fractional posterior counts instead of whole counts. Repeat. 34 / 99

  35. E step: Calculating Posterior Counts e.g. , posterior count γ ( a , t ) of taking arc a at time t . γ ( a , t ) = P ( paths with arc a at time t ) P ( all paths ) 1 P ( x ) × P ( paths from start to src( a ) ) × = P ( arc a at time t ) × P ( paths from dst( a ) to end ) 1 P ( x ) × α ( src ( a ) , t − 1 ) × p a × P ( � = x t | a ) × β ( dst ( a ) , t ) Do Forward algorithm: α ( S , t ) , P ( x ) . Do Backward algorithm: β ( S , t ) . Read off posterior counts. 35 / 99

  36. M step: Non-Hidden ML Estimation Count and normalize. Same stats as non-hidden, except normalizer is fractional. e.g. , arc prob p a � (count of a ) t γ ( a , t ) p a = src ( a ′ )= src ( a ) (count of a ′ ) = � � � t γ ( a ′ , t ) src ( a ′ )= src ( a ) e.g. , single Gaussian, mean µ a , d for dim d . � t γ ( a , t ) x t , d µ a , d = (mean weighted by γ ( a , t ) ) = � t γ ( a , t ) 36 / 99

  37. Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 37 / 99

  38. What is Decoding? (answer) = arg max P ω ( x test ) ω ∈ vocab 38 / 99

  39. What Algorithm? (answer) = arg max P ω ( x test ) ω ∈ vocab For each word ω , how to compute P ω ( x test ) ? Forward or Viterbi algorithm. 39 / 99

  40. What Are We Trying To Compute? � P ( x ) = P ( x , A ) paths A T � � = p a t × P ( � x t | a t ) paths A t = 1 T � � = (arc cost) paths A t = 1 40 / 99

  41. Dynamic Programming Shortest path problem. T A � (answer) = min (edge length) paths A t = 1 Forward algorithm. T � � P ( x ) = (arc cost) paths A t = 1 Viterbi algorithm. T � P ( x ) ≈ max (arc cost) paths A t = 1 Any semiring will do. 41 / 99

  42. Scaling How does decoding time scale with vocab size? 42 / 99

  43. The One Big HMM Paradigm: Before 43 / 99

  44. The One Big HMM Paradigm: After ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ How does this help us? 44 / 99

  45. Pruning What is time complexity of Forward/Viterbi? How many values α ( S , t ) to fill? Idea: only fill k best cells at each frame. What is time complexity? How does this scale with vocab size? 45 / 99

  46. How Does This Change Decoding? Run Forward/Viterbi once , on one big HMM . . . Instead of once for every word model. Same algorithm; different graph! ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 46 / 99

  47. Forward or Viterbi? What are we trying to compute? Total prob? Viterbi prob? Best word? ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 47 / 99

  48. Recovering the Word Identity ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 48 / 99

  49. Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 49 / 99

  50. Hyperparameters What is a hyperparameter ? A tunable knob or something adjustable . . . That can’t be estimated with “normal” training. Can you name some? Number of states in each word HMM. HMM topology. Number of GMM components. 50 / 99

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend