Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1

Reminders • Homework 6: PAC Learning / Generative Models – Out: Wed, Oct 31 – Due: Wed, Nov 7 at 11:59pm (1 week) • Homework 7: HMMs – Out: Wed, Nov 7 – Due: Mon, Nov 19 at 11:59pm • Grades are up on Canvas 2

Q&A Q: Why would we use Naïve Bayes? Isn’t it too Naïve? A: Naïve Bayes has one key advantage over methods like Perceptron, Logistic Regression, Neural Nets: Training is lightning fast! While other methods require slow iterative training procedures that might require hundreds of epochs, Naïve Bayes computes its parameters in closed form by counting. 3

DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 4

Generative vs. Discriminative • Generative Classifiers: – Example: Naïve Bayes – Define a joint model of the observations x and the labels y: p ( x , y ) – Learning maximizes (joint) likelihood – Use Bayes’ Rule to classify based on the posterior: p ( y | x ) = p ( x | y ) p ( y ) /p ( x ) • Discriminative Classifiers: – Example: Logistic Regression – Directly model the conditional: p ( y | x ) – Learning maximizes conditional likelihood 5

Generative vs. Discriminative Whiteboard – Contrast: To model p(x) or not to model p(x)? 6

Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 7

solid: NB dashed: LR 8 Slide courtesy of William Cohen

solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters “On Discriminative vs Generative Classifiers: ….” Andrew Ng and Michael Jordan, NIPS 2001. 9 Slide courtesy of William Cohen

Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution – must use iterative optimization techniques instead 10

Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 11

Naïve Bayes vs. Logistic Reg. Features Naïve Bayes: Features x are assumed to be conditionally independent given y . (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x . They can be dependent and correlated in any fashion. 12

MOTIVATION: STRUCTURED PREDICTION 13

Structured Prediction • Most of the models we’ve seen so far were for classification – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a (binary) label: y • Many real-world problems require structured prediction – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a structure: y = (y 1 , y 2 , …, y J ) • Some classification problems benefit from latent structure 14

Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 15

Dataset for Supervised Part-of-Speech (POS) Tagging D = { x ( n ) , y ( n ) } N Data: n =1 y (1) p n n v d Sample 1: x (1) like time flies an arrow y (2) n n v d n Sample 2: x (2) time flies like an arrow y (3) n v p n n Sample 3: x (3) fly with flies their wings y (4) p n n v v Sample 4: x (4) you with time will see 16

Dataset for Supervised Handwriting Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) u n e x t p e c e d x (1) Sample 2: y (2) o l v c a n i c x (2) Sample 2: y (3) e m b r a c e s x (3) 17 Figures from (Chatzis & Demiris, 2013)

Dataset for Supervised Phoneme (Speech) Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) h# dh ih s iy w uh z z iy x (1) Sample 2: y (2) f ao r ah s s h# x (2) 18 Figures from (Jansen & Niyogi, 2013)

Application: Word Alignment / Phrase Extraction • Variables (boolean) : – For each (Chinese phrase, English phrase) pair, are they linked? • Interactions : – Word fertilities – Few “jumps” (discontinuities) – Syntactic reorderings – “ITG contraint” on alignment – Phrases are disjoint (?) (Burkett & Klein, 2012) 19

Application: Congressional Voting • Variables : – Text of all speeches of a representative – Local contexts of references between two representatives • Interactions : – Words used by representative and their vote – Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 20

Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 21

Case Study: Object Recognition Data consists of images x and labels y . x (2) x (1) y (2) y (1) pigeon rhinoceros x (3) x (4) y (3) y (4) leopard llama 22

Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s parts (e.g. head, leg, tail, torso, grass) • Define graphical model with these latent variables in mind • z is not observed at leopard train or test time 23

Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, tail, torso, grass) X 7 • Define graphical Z 2 model with these latent variables in mind X 2 • z is not observed at leopard Y train or test time 24

Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, ψ 4 tail, torso, grass) ψ 1 X 7 ψ 4 • Define graphical Z 2 ψ 4 ψ 2 model with these ψ 3 latent variables in mind X 2 • z is not observed at leopard Y train or test time 25

� Structured Prediction Preview of challenges to come… • Consider the task of finding the most probable assignment to the output Classification Structured Prediction ˆ y = �� ˆ p ( y | � ) � = �� p ( � | � ) y where � ∈ Y where y ∈ { +1 , − 1 } and |Y| is very large 26

Machine Learning Our model The data inspires defines a score the structures for each structure we want to predict It also tells us Domain Mathematical Knowledge Modeling what to optimize ML Inference finds Combinatorial Optimization { best structure, marginals, Optimization partition function }for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 27

Machine Learning Model Data X 1 X 3 arrow X 2 an like flies time X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 28

BACKGROUND 29

Background: Chain Rule of Probability For random variables A and B : P ( A, B ) = P ( A | B ) P ( B ) For random variables X 1 , X 2 , X 3 , X 4 : P ( X 1 , X 2 , X 3 , X 4 ) = P ( X 1 | X 2 , X 3 , X 4 ) P ( X 2 | X 3 , X 4 ) P ( X 3 | X 4 ) P ( X 4 ) 31

� Background: Conditional Independence Random variables A and B are conditionally independent given C if: (1) P ( A, B | C ) = P ( A | C ) P ( B | C ) or equivalently: (2) P ( A | B, C ) = P ( A | C ) We write this as: (3) B | C A Later we will also | write: I<A, {C}, B> 32

HIDDEN MARKOV MODEL (HMM) 33

HMM Outline • Motivation – Time Series Data • Hidden Markov Model (HMM) – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – Background: Markov Models – From Mixture Model to HMM – History of HMMs – Higher-order HMMs • Training HMMs – (Supervised) Likelihood for HMM – Maximum Likelihood Estimation (MLE) for HMM – EM for HMM (aka. Baum-Welch algorithm) • Forward-Backward Algorithm – Three Inference Problems for HMM – Great Ideas in ML: Message Passing – Example: Forward-Backward on 3-word Sentence – Derivation of Forward Algorithm – Forward-Backward Algorithm – Viterbi algorithm 34

Markov Models Whiteboard – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – First-order Markov assumption – Conditional independence assumptions 35

Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework 6: PAC Learning / Generative

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

RESTRICTED BOLTZMANN MACHINES DANIEL KOHLSDORF LAST LECTURE: DEEP AUTO ENCODERS Directed Model

Dimensionality Reduc1on Lecture 23 David Sontag New York

Housekeeping Welcome to today s ACM Webinar. The presentation starts at the top of the

Linear Regression via Normal Equations some material thanks to Andrew Ng @Stanford Course Map /

General Session NYSLRS Retirement Online Employer Workshop Presented by: New York State &

x86 Memory Protec.on and Transla.on Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram

Pre-Submi*al Workshop Project Partners Introduc/on In a crea7ve city known for its art and

Linux kernel synchroniza2on Don Porter CSE 506 1 CSE 506: Opera.ng Systems Logical Diagram

Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework 6: PAC Learning / Generative

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

RESTRICTED BOLTZMANN MACHINES DANIEL KOHLSDORF LAST LECTURE: DEEP AUTO ENCODERS Directed Model

Dimensionality Reduc1on Lecture 23 David Sontag New York

Housekeeping Welcome to today s ACM Webinar. The presentation starts at the top of the

Linear Regression via Normal Equations some material thanks to Andrew Ng @Stanford Course Map /

General Session NYSLRS Retirement Online Employer Workshop Presented by: New York State &amp;

x86 Memory Protec.on and Transla.on Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram

Pre-Submi*al Workshop Project Partners Introduc/on In a crea7ve city known for its art and

Linux kernel synchroniza2on Don Porter CSE 506 1 CSE 506: Opera.ng Systems Logical Diagram

General Session NYSLRS Retirement Online Employer Workshop Presented by: New York State &