coms 4721 machine learning for data science lecture 21 4
play

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University H IDDEN M ARKOV M ODELS O VERVIEW Motivation We have seen how Markov


  1. COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. H IDDEN M ARKOV M ODELS

  3. O VERVIEW Motivation We have seen how Markov models can model sequential data. ◮ We assumed the observation was the sequence of states. ◮ Instead, each state may define a distribution on observations. Hidden Markov model A hidden Markov model treats a sequence of data slightly differently. ◮ Assume a hidden (i.e., unobserved, latent) sequence of states. ◮ An observation is drawn from the distribution associated with its state. s 1 s 2 s n−1 s n s n+1 s 1 s 2 s 3 s 4 x n−1 x n+1 x 1 x 2 x n Markov model hidden Markov model

  4. M ARKOV TO H IDDEN M ARKOV MODELS Markov models A 22 Imagine we have three possible states in R 2 . A 21 The data is a sequence of these positions. A 12 k = 2 Since there are only three unique positions, A 32 A 23 A 11 k = 1 we can give an index in place of coordinates. k = 3 A 31 A 13 For example, the sequence ( 1 , 2 , 1 , 3 , 2 , . . . ) would map to a sequence of 2-D vectors. A 33 Using the notation of the figure, A is a 3 × 3 transition matrix . A ij is the probability of transitioning from state i to state j .

  5. M ARKOV TO H IDDEN M ARKOV MODELS Hidden Markov models 1 Now imagine the same three states, but each k = 2 time the coordinates are randomly permuted. k = 1 0.5 The state sequence is still a set of indexes, e.g., ( 1 , 2 , 1 , 3 , 2 , . . . ) of positions in R 2 . k = 3 However, if µ 1 is the position of state # 1, 0 then we observe x i = µ 1 + ǫ i if s i = 1. 0 0.5 1 Exactly as before, we have a state transition matrix A (in this case 3 × 3). However, the observed data is a sequence ( x 1 , x 2 , x 3 , . . . ) where each x ∈ R 2 is a random perturbation of the state it’s assigned to { µ 1 , µ 2 , µ 3 } .

  6. M ARKOV TO H IDDEN M ARKOV MODELS A 22 1 1 A 21 k = 2 A 12 k = 2 k = 1 A 32 A 23 k = 1 A 11 0.5 0.5 k = 3 k = 3 A 31 A 13 0 0 A 33 0 0.5 1 0 0.5 1 A continuous hidden Markov model This HMM is continuous because each x ∈ R 2 in the sequence ( x 1 , . . . , x T ) . (left) A Markov state transition distribution for an unobserved sequence (middle) The state-dependent distributions used to generate observations (right) The data sequence. Colors indicate the distribution (state) used.

  7. H IDDEN M ARKOV M ODELS Definition A hidden Markov model (HMM) consists of: ◮ An S × S Markov transition matrix A for transitioning between S states. ◮ An initial state distribution π for selecting the first state. ◮ A state-dependent emission distribution , Prob ( x i | s i = k ) = p ( x i | θ s i ) . The model generates a sequence ( x 1 , x 2 , x 3 . . . ) by: 1. Sampling the first state s 1 ∼ Discrete ( π ) and x 1 ∼ p ( x | θ s 1 ) . 2. Sampling the Markov chain of states, s i |{ s i − 1 = k } ∼ Discrete ( A k , : ) , followed by the observation x i | s i ∼ p ( x | θ s i ) . Continuous HMM : p ( x | θ s ) is a continuous distribution, often Gaussian. Discrete HMM : p ( x | θ s ) is a discrete distribution, θ s a vector of probabilities. We focus on discrete case. Let B be a matrix, where B k , : = θ k (from above).

  8. E XAMPLE : D ISHONEST C ASINO Problem Here is an example of a discrete hidden Markov model. ◮ Consider two dice, one is fair and one is unfair. ◮ At each roll, we either keep the current dice, or switch to the other one. ◮ The observation is the sequence of numbers rolled. The transition matrix is � 0 . 95 ���� ���� � 0 . 05 A = 0 . 10 0 . 90 ������� �������� ��� ������� �������� The emission matrix is ������� �������� � � 1 1 1 1 1 1 ������� �������� ���� 6 6 6 6 6 6 B = ������� �������� 1 1 1 1 1 1 10 10 10 10 10 2 ������� �������� Let π = [ 1 1 2 ] . 2

  9. S OME ESTIMATION PROBLEMS State estimation ◮ Given: An HMM { π, A , B } and observation sequence ( x 1 , . . . , x T ) ◮ Estimate: State probability for x i using “forward-backward algorithm,” p ( s i = k | x 1 , . . . , x T , π, A , B ) . State sequence ◮ Given: An HMM { π, A , B } and observation sequence ( x 1 , . . . , x T ) ◮ Estimate: Most probable state sequence using the “Viterbi algorithm,” s 1 , . . . , s T = arg max p ( s 1 , . . . , s T | x 1 , . . . , x T , π, A , B ) . s Learn an HMM ◮ Given: An observation sequence ( x 1 , . . . , x T ) ◮ Estimate: HMM parameters π, A , B using maximum likelihood π ML , A ML , B ML = arg max π, A , B p ( x 1 , . . . , x T | π, A , B )

  10. E XAMPLES Before we look at the details, here are examples for the dishonest casino. ◮ Not shown is that π, A , B were learned first in order to calculate this. ◮ Notice that the right plot isn’t just a rounding of the left plot. filtered Viterbi 1 1 MAP state (0=fair,1=loaded) p(loaded) 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 roll number roll number State estimation result State sequence result Gray bars: Loaded dice used Gray bars: Loaded dice used Blue: Probability p ( s i = loaded | x 1 : T , π, A , B ) Blue: Most probable state sequence

  11. L EARNING THE HMM

  12. L EARNING THE HMM: T HE LIKELIHOOD We focus on the discrete HMM. To learn the HMM parameters, maximize S S � � p ( x | π, A , B ) = · · · p ( x , s 1 , . . . , s T | π, A , B ) s 1 = 1 s T = 1 S S T � � � = · · · p ( x i | s i , B ) p ( s i | s i − 1 , π, A ) s 1 = 1 s T = 1 i = 1 ◮ p ( x i | s i , B ) = B s i , x i ← s i indexes the distribution, x i is the observation ◮ p ( s i | s i − 1 , π, A ) = A s i − 1 , s i ( or π s 1 ) ← since s 1 , . . . , s T is a Markov chain

  13. L EARNING THE HMM: T HE LOG LIKELIHOOD ◮ Maximizing p ( x | π, A , B ) is hard since the objective has log-sum form S S T � � � ln p ( x | π, A , B ) = ln · · · p ( x i | s i , B ) p ( s i | s i − 1 , π, A ) s 1 = 1 s T = 1 i = 1 ◮ However, if we had or learned s it would be easy (remove the sums). ◮ In addition, we can calculate p ( s | x , π, A , B ) , though it’s much more complicated than in previous models. ◮ Therefore, we can use the EM algorithm! The following is high-level.

  14. L EARNING THE HMM: T HE LOG LIKELIHOOD E-step : Using q ( s ) = p ( s | x , π, A , B ) , calculate L ( x , π, A , B ) = E q [ ln p ( x , s | π, A , B )] . M-Step : Maximize L with respect to π, A , B . This part is tricky since we need to take the expectation using q ( s ) of T S S � � � ln p ( x , s | π, A , B ) = 1 ( s i = k ) ln B k , x i + 1 ( s 1 = k ) ln π k � �� � � �� � i = 1 k = 1 k = 1 observations initial state T S S � � � + 1 ( s i − 1 = j , s i = k ) ln A j , k � �� � i = 2 j = 1 k = 1 Markov chain The following is an overview to help you better navigate the books/tutorials. 1 1 See the classic tutorial: Rabiner, L.R. (1989). “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77 (2), 257–285.

  15. L EARNING THE HMM WITH EM E-Step Let’s define the following conditional posterior quantities: γ i ( k ) = the posterior probability that s i = k ξ i ( j , k ) = the posterior probability that s i − 1 = j and s i = k Therefore, γ i is a vector and ξ i is a matrix, both varying over i . We can calculate both of these using the “forward-backward” algorithm. (We won’t cover it in this class, but Rabiner’s tutorial is good.) Given these values the E-step is: S T S S T S � � � � � � L = γ 1 ( k ) ln π k + ξ i ( j , k ) ln A j , k + γ i ( k ) ln B k , x i k = 1 i = 2 j = 1 k = 1 i = 1 k = 1 This gives us everything we need to update π, A , B .

  16. L EARNING THE HMM WITH EM M-Step The updates for the HMM parameters are: � T � T γ 1 ( k ) i = 2 ξ i ( j , k ) i = 1 γ i ( k ) 1 { x i = v } π k = j γ 1 ( j ) , A j , k = , B k , v = � � T � S � T l = 1 ξ i ( j , l ) i = 1 γ i ( k ) i = 2 The updates can be understood as follows: ◮ A j , k is the expected fraction of transitions j → k when we start at j ◮ Numerator: Expected count of transitions j → k ◮ Denominator: Expected total number of transitions from j ◮ B k , v is the expected fraction of data coming from state k and equal to v ◮ Numerator: Expected number of observations = v from state k ◮ Denominator: Expected total number of observations from state k ◮ π has interpretation similar to A

  17. L EARNING THE HMM WITH EM M-Step: N sequences Usually we’ll have multiple sequences that are modeled by an HMM. In this case, the updates for the HMM parameters with N sequences are: � N � N � T n n = 1 γ n i = 2 ξ n 1 ( k ) i ( j , k ) n = 1 π k = , A j , k = , � N � � N � T n � S j γ n 1 ( j ) l = 1 ξ n i ( j , l ) n = 1 n = 1 i = 2 � N � T n i = 1 γ n i ( k ) 1 { x i = v } n = 1 B k , v = � N � T n i = 1 γ n i ( k ) n = 1 The modifications are: ◮ Each sequence can be of different length, T n ◮ Each sequence has its own set of γ and ξ values ◮ Using this we sum over the sequences, with the interpretation the same.

  18. A PPLICATION : S PEECH RECOGNITION

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend