hidden markov models
play

Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework 6: PAC Learning / Generative


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1

  2. Reminders • Homework 6: PAC Learning / Generative Models – Out: Wed, Oct 31 – Due: Wed, Nov 7 at 11:59pm (1 week) • Homework 7: HMMs – Out: Wed, Nov 7 – Due: Mon, Nov 19 at 11:59pm • Grades are up on Canvas 2

  3. Q&A Q: Why would we use Naïve Bayes? Isn’t it too Naïve? A: Naïve Bayes has one key advantage over methods like Perceptron, Logistic Regression, Neural Nets: Training is lightning fast! While other methods require slow iterative training procedures that might require hundreds of epochs, Naïve Bayes computes its parameters in closed form by counting. 3

  4. DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 4

  5. Generative vs. Discriminative • Generative Classifiers: – Example: Naïve Bayes – Define a joint model of the observations x and the labels y: p ( x , y ) – Learning maximizes (joint) likelihood – Use Bayes’ Rule to classify based on the posterior: p ( y | x ) = p ( x | y ) p ( y ) /p ( x ) • Discriminative Classifiers: – Example: Logistic Regression – Directly model the conditional: p ( y | x ) – Learning maximizes conditional likelihood 5

  6. Generative vs. Discriminative Whiteboard – Contrast: To model p(x) or not to model p(x)? 6

  7. Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 7

  8. solid: NB dashed: LR 8 Slide courtesy of William Cohen

  9. solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters “On Discriminative vs Generative Classifiers: ….” Andrew Ng and Michael Jordan, NIPS 2001. 9 Slide courtesy of William Cohen

  10. Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution – must use iterative optimization techniques instead 10

  11. Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 11

  12. Naïve Bayes vs. Logistic Reg. Features Naïve Bayes: Features x are assumed to be conditionally independent given y . (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x . They can be dependent and correlated in any fashion. 12

  13. MOTIVATION: STRUCTURED PREDICTION 13

  14. Structured Prediction • Most of the models we’ve seen so far were for classification – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a (binary) label: y • Many real-world problems require structured prediction – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a structure: y = (y 1 , y 2 , …, y J ) • Some classification problems benefit from latent structure 14

  15. Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 15

  16. Dataset for Supervised Part-of-Speech (POS) Tagging D = { x ( n ) , y ( n ) } N Data: n =1 y (1) p n n v d Sample 1: x (1) like time flies an arrow y (2) n n v d n Sample 2: x (2) time flies like an arrow y (3) n v p n n Sample 3: x (3) fly with flies their wings y (4) p n n v v Sample 4: x (4) you with time will see 16

  17. Dataset for Supervised Handwriting Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) u n e x t p e c e d x (1) Sample 2: y (2) o l v c a n i c x (2) Sample 2: y (3) e m b r a c e s x (3) 17 Figures from (Chatzis & Demiris, 2013)

  18. Dataset for Supervised Phoneme (Speech) Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) h# dh ih s iy w uh z z iy x (1) Sample 2: y (2) f ao r ah s s h# x (2) 18 Figures from (Jansen & Niyogi, 2013)

  19. Application: Word Alignment / Phrase Extraction • Variables (boolean) : – For each (Chinese phrase, English phrase) pair, are they linked? • Interactions : – Word fertilities – Few “jumps” (discontinuities) – Syntactic reorderings – “ITG contraint” on alignment – Phrases are disjoint (?) (Burkett & Klein, 2012) 19

  20. Application: Congressional Voting • Variables : – Text of all speeches of a representative – Local contexts of references between two representatives • Interactions : – Words used by representative and their vote – Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 20

  21. Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 21

  22. Case Study: Object Recognition Data consists of images x and labels y . x (2) x (1) y (2) y (1) pigeon rhinoceros x (3) x (4) y (3) y (4) leopard llama 22

  23. Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s parts (e.g. head, leg, tail, torso, grass) • Define graphical model with these latent variables in mind • z is not observed at leopard train or test time 23

  24. Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, tail, torso, grass) X 7 • Define graphical Z 2 model with these latent variables in mind X 2 • z is not observed at leopard Y train or test time 24

  25. Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, ψ 4 tail, torso, grass) ψ 1 X 7 ψ 4 • Define graphical Z 2 ψ 4 ψ 2 model with these ψ 3 latent variables in mind X 2 • z is not observed at leopard Y train or test time 25

  26. � Structured Prediction Preview of challenges to come… • Consider the task of finding the most probable assignment to the output Classification Structured Prediction ˆ y = ������ ˆ p ( y | � ) � = ������ p ( � | � ) y where � ∈ Y where y ∈ { +1 , − 1 } and |Y| is very large 26

  27. Machine Learning Our model The data inspires defines a score the structures for each structure we want to predict It also tells us Domain Mathematical Knowledge Modeling what to optimize ML Inference finds Combinatorial Optimization { best structure, marginals, Optimization partition function }for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 27

  28. Machine Learning Model Data X 1 X 3 arrow X 2 an like flies time X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 28

  29. BACKGROUND 29

  30. Background: Chain Rule of Probability For random variables A and B : P ( A, B ) = P ( A | B ) P ( B ) For random variables X 1 , X 2 , X 3 , X 4 : P ( X 1 , X 2 , X 3 , X 4 ) = P ( X 1 | X 2 , X 3 , X 4 ) P ( X 2 | X 3 , X 4 ) P ( X 3 | X 4 ) P ( X 4 ) 31

  31. � Background: Conditional Independence Random variables A and B are conditionally independent given C if: (1) P ( A, B | C ) = P ( A | C ) P ( B | C ) or equivalently: (2) P ( A | B, C ) = P ( A | C ) We write this as: (3) B | C A Later we will also | write: I<A, {C}, B> 32

  32. HIDDEN MARKOV MODEL (HMM) 33

  33. HMM Outline • Motivation – Time Series Data • Hidden Markov Model (HMM) – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – Background: Markov Models – From Mixture Model to HMM – History of HMMs – Higher-order HMMs • Training HMMs – (Supervised) Likelihood for HMM – Maximum Likelihood Estimation (MLE) for HMM – EM for HMM (aka. Baum-Welch algorithm) • Forward-Backward Algorithm – Three Inference Problems for HMM – Great Ideas in ML: Message Passing – Example: Forward-Backward on 3-word Sentence – Derivation of Forward Algorithm – Forward-Backward Algorithm – Viterbi algorithm 34

  34. Markov Models Whiteboard – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – First-order Markov assumption – Conditional independence assumptions 35

  35. 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend