Lecture 5 The Big Picture/Language Modeling Michael Picheny, - PowerPoint PPT Presentation

Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 17 February 2016

Administrivia Slides posted before lecture may not match lecture. Lab 1 Not graded yet; will be graded by next lecture? Awards ceremony for evaluation next week. Grading: what’s up with the optional exercises? Lab 2 Due nine days from now (Friday, Feb. 26) at 6pm. Start early! Avail yourself of Piazza. 2 / 99

Feedback Clear (4); mostly clear (2); unclear (3). Pace: fast (3); OK (2). Muddiest: HMM’s in general (1); Viterbi (1); FB (1). Comments (2+ votes): want better/clearer examples (5) spend more time walking through examples (3) spend more time on high-level intuition before getting into details (3) good examples (2) 3 / 99

Celebrity Sighting New York Times 4 / 99

Part I The HMM/GMM Framework 5 / 99

Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 6 / 99

The Raw Data 0.5 0 −0.5 −1 0 0.5 1 1.5 2 2.5 4 x 10 What do we do with waveforms? 7 / 99

Front End Processing Convert waveform to features . 8 / 99

What Have We Gained? Time domain ⇒ frequency domain. Removed vocal-fold excitation. Made features independent. 9 / 99

ASR 1.0: Dynamic Time Warping 10 / 99

Computing the Distance Between Utterances Find “best” alignment between frames. Sum distances between aligned frames. Sum penalties for “weird” alignments. 11 / 99

ASR 2.0: The HMM/GMM Framework 12 / 99

Notation 13 / 99

How Do We Do Recognition? x test = test features; P ω ( x ) = word model. (answer) =??? (answer) = arg max P ω ( x test ) ω ∈ vocab Return the word whose model . . . Assigns the highest prob to the utterance. 14 / 99

Putting it All Together P ω ( x ) = ??? How do we actually train? How do we actually decode ? It’s a puzzlement by jubgo . Some rights reserved. 15 / 99

So What’s the Model? P ω ( x ) =??? Frequency that word ω generates features x . Has something to do with HMM’s and GMM’s. Untitled by Daniel Oines . Some rights reserved. 17 / 99

A Word Is A Sequence of Sounds e.g. , the word ONE : W → AH → N . Phoneme inventory. AA AE AH AO AW AX AXR AY B BD CH D DD DH DX EH ER EY F G GD HH IH IX IY JH K KD L M N NG OW OY P PD R S SH T TD TH TS UH UW V W X Y Z ZH What sounds make up TWO ? What do we use to model sequences? 18 / 99

HMM, v1.0 Outputs on arcs, not states. What’s the problem? What are the outputs? 19 / 99

HMM, v2.0 What’s the problem? How many frames per phoneme? 20 / 99

HMM, v3.0 Are we done? 21 / 99

Concept: Alignment ⇔ Path Path through HMM ⇒ sequence of arcs, one per frame. Notation: A = a 1 · · · a T . a t = which arc generated frame t . 22 / 99

The Game Plan Express P ω ( x ) , the total prob of x . . . In terms of P ω ( x , A ) , the prob of a single path. How? � P ( x ) = (path prob) paths A � = P ( x , A ) paths A Sum over all paths. 23 / 99

How To Compute the Likelihood of a Path? Path: A = a 1 · · · a T . T � P ( x , A ) = (arc prob) × (output prob) t = 1 T � p a t × P ( � x t | a t ) = t = 1 Multiply arc, output probs along path. 24 / 99

What Do Output Probabilities Look Like? Mixture of diagonal-covariance Gaussians. � � P ( � x | a ) = (mixture wgt) (Gaussian for dim d ) comp j dim d � � = p a , j N ( x d ; µ a , j , d , σ a , j , d ) comp j dim d 25 / 99

The Full Model � P ( x ) = P ( x , A ) paths A T � � p a t × P ( � = x t | a t ) paths A t = 1 T � � � � N ( x t , d ; µ a t , j , d , σ 2 = p a t p a t , j a t , j , d ) paths A t = 1 comp j dim d p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 26 / 99

Pop Quiz What was the equation on the last slide? 27 / 99

Training How to create model P ω ( x ) from examples x ω, 1 , x ω, 2 , . . . ? 29 / 99

What is the Goal of Training? To estimate parameters . . . To maximize likelihood of training data. Crossfit 0303 by Runar Eilertsen . Some rights reserved. 30 / 99

What Are the Model Parameters? p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 31 / 99

Warm-Up: Non-Hidden ML Estimation e.g. , Gaussian estimation, non-hidden Markov Models. How to do this? (Hint: ??? and ???.) parameter description statistic p a arc prob # times arc taken p a , j mixture wgt # times component used µ a , j , d mean x d σ 2 x 2 variance a , j , d d Count and normalize. i.e. , collect a statistic; divide by normalizer count. 32 / 99

How To Estimate Hidden Models? The EM algorithm ⇒ FB algorithm for HMM’s. Hill-climbing maximum-likelihood estimation. Uphill Struggle by Ewan Cross . Some rights reserved. 33 / 99

The EM Algorithm Expectation step. Using current model, compute posterior counts . . . Prob that thing occurred at time t . Maximization step. Like non-hidden MLE, except . . . Use fractional posterior counts instead of whole counts. Repeat. 34 / 99

E step: Calculating Posterior Counts e.g. , posterior count γ ( a , t ) of taking arc a at time t . γ ( a , t ) = P ( paths with arc a at time t ) P ( all paths ) 1 P ( x ) × P ( paths from start to src( a ) ) × = P ( arc a at time t ) × P ( paths from dst( a ) to end ) 1 P ( x ) × α ( src ( a ) , t − 1 ) × p a × P ( � = x t | a ) × β ( dst ( a ) , t ) Do Forward algorithm: α ( S , t ) , P ( x ) . Do Backward algorithm: β ( S , t ) . Read off posterior counts. 35 / 99

M step: Non-Hidden ML Estimation Count and normalize. Same stats as non-hidden, except normalizer is fractional. e.g. , arc prob p a � (count of a ) t γ ( a , t ) p a = src ( a ′ )= src ( a ) (count of a ′ ) = � � � t γ ( a ′ , t ) src ( a ′ )= src ( a ) e.g. , single Gaussian, mean µ a , d for dim d . � t γ ( a , t ) x t , d µ a , d = (mean weighted by γ ( a , t ) ) = � t γ ( a , t ) 36 / 99

What is Decoding? (answer) = arg max P ω ( x test ) ω ∈ vocab 38 / 99

What Algorithm? (answer) = arg max P ω ( x test ) ω ∈ vocab For each word ω , how to compute P ω ( x test ) ? Forward or Viterbi algorithm. 39 / 99

What Are We Trying To Compute? � P ( x ) = P ( x , A ) paths A T � � = p a t × P ( � x t | a t ) paths A t = 1 T � � = (arc cost) paths A t = 1 40 / 99

Dynamic Programming Shortest path problem. T A � (answer) = min (edge length) paths A t = 1 Forward algorithm. T � � P ( x ) = (arc cost) paths A t = 1 Viterbi algorithm. T � P ( x ) ≈ max (arc cost) paths A t = 1 Any semiring will do. 41 / 99

Scaling How does decoding time scale with vocab size? 42 / 99

The One Big HMM Paradigm: Before 43 / 99

The One Big HMM Paradigm: After ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ How does this help us? 44 / 99

Pruning What is time complexity of Forward/Viterbi? How many values α ( S , t ) to fill? Idea: only fill k best cells at each frame. What is time complexity? How does this scale with vocab size? 45 / 99

How Does This Change Decoding? Run Forward/Viterbi once , on one big HMM . . . Instead of once for every word model. Same algorithm; different graph! ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 46 / 99

Forward or Viterbi? What are we trying to compute? Total prob? Viterbi prob? Best word? ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 47 / 99

Recovering the Word Identity ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 48 / 99

Hyperparameters What is a hyperparameter ? A tunable knob or something adjustable . . . That can’t be estimated with “normal” training. Can you name some? Number of states in each word HMM. HMM topology. Number of GMM components. 50 / 99

Lecture 5 The Big Picture/Language Modeling Michael Picheny, - PowerPoint PPT Presentation

Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Topological Hyperplane Arrangements David Forge, LRI, Universit e Paris-Sud and Thomas

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

On the (im-) possibility of cold to warm distillation Henning Struchtrup University of Victoria,

Higher-Order Masking Schemes for S-boxes Matthieu Rivain Joint work with C. Carlet, L. Goubin,

Constraining Queuing Delay in a Constraining Queuing Delay in a Router based on Superposition of

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

Dependent Types in Haskell What is Dependent Type Theory? are types, proofs are programs

Machine Placement and Migration in Cloud Datacenters Authors: Liuhua Chen, Haiying Shen and