Hidden Markov Models Based on Foundations of Statistical NLP by C. - PowerPoint PPT Presentation

0. Hidden Markov Models Based on • “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 9, MIT Press, 2002 • “Biological Sequence Analysis”, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998

1. PLAN 1 Markov Models Markov assumptions 2 Hidden Markov Models 3 Fundamental questions for HMMs 3.1 Probability of an observation sequence: the Forward algorithm, the Backward algorithm 3.2 Finding the “best” sequence: the Viterbi algorithm 3.3 HMM parameter estimation: the Forward-Backward (EM) algorithm 4 HMM extensions 5 Applications

2. 1 Markov Models (generally) Markov Models are used to model a sequence of ran- dom variables in which each element depends on pre- vious elements. X = � X 1 . . . X T � X t ∈ S = { s 1 , . . . , s N } X is also called a Markov Process or Markov Chain. S = set of states Π = initial state probabilities π i = P ( X 1 = s i ) ; � N i =1 π i = 1 A = transition probabilities: a ij = P ( X t +1 = s j | X t = s i ) ; � N j =1 a ij = 1 ∀ i

4. A 1st Markov chain example: DNA (from [ Durbin et al., 1998 ] ) A T Note: Here we leave transition probabilities unspecified. C G

5. A 2nd Markov chain example: CpG islands in DNA sequences Maximum Likelihood estimation of parameters using real data (+ and -) c + c − a + st a − st st = st = t ′ c + t ′ c − � � st ′ st ′ + A − A C G T C G T A 0 . 180 0 . 274 0 . 426 0 . 120 A 0 . 300 0 . 205 0 . 285 0 . 210 C 0 . 171 0 . 368 0 . 274 0 . 188 C 0 . 322 0 . 298 0 . 078 0 . 302 G 0 . 161 0 . 339 0 . 375 0 . 125 G 0 . 248 0 . 246 0 . 298 0 . 208 0 . 079 0 . 355 0 . 384 0 . 182 0 . 177 0 . 239 0 . 292 0 . 292 T T

6. Using log likelihoood (log-odds) ratios for discrimination L L a + P ( x | model + ) x i − 1 x i � � S ( x ) = log 2 P ( x | model − ) = log 2 = β x i − 1 x i a − x i − 1 x i i =1 i =1 β A C G T A − 0 . 740 0 . 419 0 . 580 − 0 . 803 C − 0 . 913 0 . 302 1 . 812 − 0 . 685 G − 0 . 624 0 . 461 0 . 331 − 0 . 730 − 1 . 169 0 . 573 0 . 393 − 0 . 679 T

7. 2 Hidden Markov Models K = output alphabet = { k 1 , . . ., k M } B = output emission probabilities: b ijk = P ( O t = k | X t = s i , X t +1 = s j ) Notice that b ijk does not depend on t . In HMMs we only observe a probabilistic function of the state sequence: � O 1 . . . O T � When the state sequence � X 1 . . . X T � is also observable: Visible Markov Model (VMM) Remark: In all our subsequent examples b ijk is independent of j .

8. A program for a HMM t = 1; start in state s i with probability π i (i.e., X 1 = i ); forever do move from state s i to state s j with prob. a ij (i.e., X t +1 = j ); emit observation symbol O t = k with probability b ijk ; t = t + 1 ;

9. A 1st HMM example: CpG islands (from [ Durbin et al., 1998 ] ) Notes: 1. In addition to the transitions shown, there is also C + G a complete set of transitions + within each set (+ respec- T A + trively -). + 2. Transition probabilities in this model are set so that within each group they are close to the transition probabilities of the original model, but there is also a small T A chance of switching into the − − other component. Over- G − C − all, there is more chance of switching from ’+’ to ’-’ than viceversa.

10. A 2nd HMM example: The occasionally dishonest casino (from [ Durbin et al., 1998 ] ) 1: 1/6 1: 1/10 2: 1/6 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 0.05 5: 1/6 5: 1/10 6: 1/6 6: 1/2 0.95 0.9 L F 0.1 0.99 0.01

11. A 2rd HMM example: The crazy soft drink machine (from [ Manning & Sch¨ utze, 2000 ] ) P(Coke) = 0.6 Ice tea = 0.1 Lemon = 0.3 0.3 Coke Ice tea 0.7 0.5 Preference Preference 0.5 P(Coke) = 0.1 π Ice tea = 0.7 =1 CP Lemon = 0.2

12. A 4th example: A tiny HMM for 5’ splice site recognition (from [ Eddy, 2004 ] )

13. 3 Three fundamental questions for HMMs 1. Probability of an Observation Sequence: Given a model µ = ( A, B, Π) over S, K , how do we (effi- ciently) compute the likelihood of a particular sequence, P ( O | µ ) ? 2. Finding the “Best” State Sequence: Given an observation sequence and a model, how do we choose a state sequence ( X 1 , . . ., X T +1 ) to best explain the observation sequence? 3. HMM Parameter Estimation: Given an observation sequence (or corpus thereof), how do we acquire a model µ = ( A, B, Π) that best explains the data?

14. 3.1 Probability of an observation sequence P ( O | X, µ ) = Π T t =1 P ( O t | X t , X t +1 , µ ) = b X 1 X 2 O 1 b X 2 X 3 O 2 . . . b X T X T +1 O T � � π X 1 Π T P ( O, µ ) = P ( O | X, µ ) P ( X, µ ) = t =1 a X t X t +1 b X t X t +1 O t X X 1 ...X T +1 (2 T + 1) N T +1 , too inefficient Complexity : better : use dynamic prog. to store partial results α i ( t ) = P ( O 1 O 2 . . . O t − 1 , X t = s i | µ ) .

15. 3.1.1 Probability of an observation sequence: The Forward algorithm 1. Initialization: α i (1) = π i , for 1 ≤ i ≤ N 2. Induction: α j ( t + 1) = � N i =1 α i ( t ) a ij b ijO t , 1 ≤ t ≤ T , 1 ≤ j ≤ N 3. Total: P ( O | µ ) = � N i =1 α i ( T + 1) . Complexity: 2 N 2 T

16. Proof of induction step: α j ( t + 1) = P ( O 1 O 2 . . . O t − 1 O t , X t +1 = j | µ ) N � = P ( O 1 O 2 . . . O t − 1 O t , X t = i, X t +1 = j | µ ) i =1 N � = P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) i =1 N � = P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) i =1 N � = α i ( t ) P ( O t , X t +1 = j | X t = i, µ ) i =1 N N � � = α i ( t ) P ( O t | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) = α i ( t ) b ijO t a ij i =1 i =1

17. Closeup of the Forward update step s 1 α 1 (t) a 1j b 1jO t s 2 α 2 (t) µ P(O ... O , X = s | ) t+1 1 t j a 2j b 2jO s j t α j (t+1) µ P(O ... O , X = s | ) t 1 t−1 i s N a Nj b NjO α N (t) t t t+1

18. Trellis s 1 Each node ( s i , t ) stores informa- s 2 tion about paths through s i at time s 3 t . State s N 1 2 Time t T+1

19. 3.1.2 Probability of an observation sequence: The Backward algorithm β i ( t ) = P ( O t . . . O T | X t = i, µ ) 1. Initialization: β i ( T + 1) = 1 , for 1 ≤ i ≤ N 2. Induction: β i ( t ) = � N j =1 a ij b ijO t β j ( t + 1) , 1 ≤ t ≤ T , 1 ≤ i ≤ N 3. Total: P ( O | µ ) = � N i =1 π i β i (1) Complexity: 2 N 2 T

The Backward algorithm: Proofs 20. Induction: β i ( t ) = P ( O t O t +1 . . . O T | X t = i, µ ) N � = P ( O t O t +1 . . . O T , X t +1 = j | X t = i, µ ) j =1 N � = P ( O t O t +1 . . . O T | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) j =1 N � = P ( O t +1 . . . O T | O t , X t = i, X t +1 = j, µ ) P ( O t | X t = i, X t +1 = j, µ ) a ij j =1 N N � � = P ( O t +1 . . . O T | X t +1 = j, µ ) b ijO t a ij = β j ( t + 1) b ijO t a ij j =1 j =1 N N � � Total: P ( O | µ ) = P ( O 1 O 2 . . . O T | X 1 = i, µ ) P ( X 1 = i | µ ) = β i (1) π i i =1 i =1

21. Combining Forward and Backward probabilities P ( O, X t = i | µ ) = α i ( t ) β i ( t ) N � P ( O | µ ) = α i ( t ) β i ( t ) for 1 ≤ t ≤ T + 1 i =1 Proofs: P ( O, X t = i | µ ) = P ( O 1 . . . O T , X t = i | µ ) = P ( O 1 . . . O t − 1 , X t = i, O t . . . O T | µ ) = P ( O 1 . . . O t − 1 , X t = i | µ ) P ( O t . . . O T | O 1 . . . O t − 1 , X t = i, µ ) = α i ( t ) P ( O t . . . O T | X t = i, µ ) = α i ( t ) β i ( t ) N N � � P ( O | µ ) = P ( O, X t = i | µ ) = α i ( t ) β i ( t ) i =1 i =1 Note: The “total” forward and backward formulae are special cases of the above one (for t = T + 1 and respectively t = 1 ).

22. 3.2 Finding the “best” state sequence 3.2.1 Posterior decoding One way to find the most likely state sequence underlying the observation sequence: choose the states individually γ i ( t ) = P ( X t = i | O, µ ) ˆ X t = argmax γ i ( t ) for 1 ≤ t ≤ T + 1 1 ≤ i ≤ N Computing γ i ( t ) : γ i ( t ) = P ( X t = i | O, µ ) = P ( X t = i, O | µ ) α i ( t ) β i ( t ) = � N P ( O | µ ) j =1 α j ( t ) β j ( t ) Remark: ˆ X maximizes the expected number of states that will be guessed cor- rectly. However, it may yield a quite unlikely/unnatural state sequence.

23. Note Sometimes not the state itself is of interest, but some other property derived from it. For instance, in the CpG islands example, let g be a function defined on the set of states: g takes the value 1 for A + , C + , G + , T + and 0 for A − , C − , G − , T − . Then � P ( π t = s j | O ) g ( s j ) j designates the posterior probability that the symbol O t come from a state in the + set. Thus it is possible to find the most probable label of the state at each position in the output sequence O .

Hidden Markov Models Based on Foundations of Statistical NLP by C. - PowerPoint PPT Presentation

0. Hidden Markov Models Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 9, MIT Press, 2002 Biological Sequence Analysis, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998 1.

1 X 1 X 2 X 3 Ghostbusters HMM Chain Rule and HMMs E 1 E 2 E 3 P(X 1 ) = uniform 1/9 1/9

CSE P 590 A Markov Models and Hidden Markov Models

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

CSE 527 " Markov Models and Hidden Markov Models !

CSE 427 Markov Models and Hidden Markov Models 2

CSE P 527 Markov Models and Hidden Markov Models 1 2

1 Real HMM Examples Real HMM Examples Speech recognition HMMs: Machine translation HMMs:

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

1 Real HMM Examples Real HMM Examples Speech recognition HMMs: Machine translation HMMs:

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

A C Standard model has 1-1 correspondence between symbols and states, P(A | T) thus P ( x i

HMM Review Lecture Outline 1. Markov models 2. Hidden Markov

Introduction to Hidden Markov Models Antonio Art es-Rodr guez Unviersidad Carlos III de

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CSE 427 Markov Models and Hidden Markov Models How Proteins Read DNA E.g.:

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

Hidden Markov Models Based on Foundations of Statistical NLP by C. - PowerPoint PPT Presentation

0. Hidden Markov Models Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 9, MIT Press, 2002 Biological Sequence Analysis, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998 1.

1 X 1 X 2 X 3 Ghostbusters HMM Chain Rule and HMMs E 1 E 2 E 3 P(X 1 ) = uniform 1/9 1/9

CSE P 590 A Markov Models and Hidden Markov Models

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

CSE 527 &quot; Markov Models and Hidden Markov Models !

CSE 427 Markov Models and Hidden Markov Models 2

CSE P 527 Markov Models and Hidden Markov Models 1 2

1 Real HMM Examples Real HMM Examples Speech recognition HMMs: Machine translation HMMs:

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

1 Real HMM Examples Real HMM Examples Speech recognition HMMs: Machine translation HMMs:

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

A C Standard model has 1-1 correspondence between symbols and states, P(A | T) thus P ( x i

HMM Review Lecture Outline 1. Markov models 2. Hidden Markov

Introduction to Hidden Markov Models Antonio Art es-Rodr guez Unviersidad Carlos III de

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CSE 427 Markov Models and Hidden Markov Models How Proteins Read DNA E.g.:

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

CSE 527 " Markov Models and Hidden Markov Models !