hidden markov models
play

Hidden Markov Models Based on Foundations of Statistical NLP by C. - PowerPoint PPT Presentation

0. Hidden Markov Models Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 9, MIT Press, 2002 Biological Sequence Analysis, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998 1.


  1. 0. Hidden Markov Models Based on • “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 9, MIT Press, 2002 • “Biological Sequence Analysis”, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998

  2. 1. PLAN 1 Markov Models Markov assumptions 2 Hidden Markov Models 3 Fundamental questions for HMMs 3.1 Probability of an observation sequence: the Forward algorithm, the Backward algorithm 3.2 Finding the “best” sequence: the Viterbi algorithm 3.3 HMM parameter estimation: the Forward-Backward (EM) algorithm 4 HMM extensions 5 Applications

  3. 2. 1 Markov Models (generally) Markov Models are used to model a sequence of ran- dom variables in which each element depends on pre- vious elements. X = � X 1 . . . X T � X t ∈ S = { s 1 , . . . , s N } X is also called a Markov Process or Markov Chain. S = set of states Π = initial state probabilities π i = P ( X 1 = s i ) ; � N i =1 π i = 1 A = transition probabilities: a ij = P ( X t +1 = s j | X t = s i ) ; � N j =1 a ij = 1 ∀ i

  4. 3. Markov assumptions • Limited Horizon: P ( X t +1 = s i | X 1 . . . X t ) = P ( X t +1 = s i | X t ) (first-order Markov model) • Time Invariance: P ( X t +1 = s j | X t = s i ) = p ij ∀ t Probability of a Markov Chain P ( X 1 . . . X T ) = P ( X 1 ) P ( X 2 | X 1 ) P ( X 3 | X 1 X 2 ) . . . P ( X T | X 1 X 2 . . . X T − 1 ) = P ( X 1 ) P ( X 2 | X 1 ) P ( X 3 | X 2 ) . . . P ( X T | X T − 1 ) = π X 1 Π T − 1 t =1 a X t X t +1

  5. 4. A 1st Markov chain example: DNA (from [ Durbin et al., 1998 ] ) A T Note: Here we leave transition probabilities unspecified. C G

  6. 5. A 2nd Markov chain example: CpG islands in DNA sequences Maximum Likelihood estimation of parameters using real data (+ and -) c + c − a + st a − st st = st = t ′ c + t ′ c − � � st ′ st ′ + A − A C G T C G T A 0 . 180 0 . 274 0 . 426 0 . 120 A 0 . 300 0 . 205 0 . 285 0 . 210 C 0 . 171 0 . 368 0 . 274 0 . 188 C 0 . 322 0 . 298 0 . 078 0 . 302 G 0 . 161 0 . 339 0 . 375 0 . 125 G 0 . 248 0 . 246 0 . 298 0 . 208 0 . 079 0 . 355 0 . 384 0 . 182 0 . 177 0 . 239 0 . 292 0 . 292 T T

  7. 6. Using log likelihoood (log-odds) ratios for discrimination L L a + P ( x | model + ) x i − 1 x i � � S ( x ) = log 2 P ( x | model − ) = log 2 = β x i − 1 x i a − x i − 1 x i i =1 i =1 β A C G T A − 0 . 740 0 . 419 0 . 580 − 0 . 803 C − 0 . 913 0 . 302 1 . 812 − 0 . 685 G − 0 . 624 0 . 461 0 . 331 − 0 . 730 − 1 . 169 0 . 573 0 . 393 − 0 . 679 T

  8. 7. 2 Hidden Markov Models K = output alphabet = { k 1 , . . ., k M } B = output emission probabilities: b ijk = P ( O t = k | X t = s i , X t +1 = s j ) Notice that b ijk does not depend on t . In HMMs we only observe a probabilistic function of the state sequence: � O 1 . . . O T � When the state sequence � X 1 . . . X T � is also observable: Visible Markov Model (VMM) Remark: In all our subsequent examples b ijk is independent of j .

  9. 8. A program for a HMM t = 1; start in state s i with probability π i (i.e., X 1 = i ); forever do move from state s i to state s j with prob. a ij (i.e., X t +1 = j ); emit observation symbol O t = k with probability b ijk ; t = t + 1 ;

  10. 9. A 1st HMM example: CpG islands (from [ Durbin et al., 1998 ] ) Notes: 1. In addition to the tran- sitions shown, there is also C + G a complete set of transitions + within each set (+ respec- T A + trively -). + 2. Transition probabilities in this model are set so that within each group they are close to the transition proba- bilities of the original model, but there is also a small T A chance of switching into the − − other component. Over- G − C − all, there is more chance of switching from ’+’ to ’-’ than viceversa.

  11. 10. A 2nd HMM example: The occasionally dishonest casino (from [ Durbin et al., 1998 ] ) 1: 1/6 1: 1/10 2: 1/6 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 0.05 5: 1/6 5: 1/10 6: 1/6 6: 1/2 0.95 0.9 L F 0.1 0.99 0.01

  12. 11. A 2rd HMM example: The crazy soft drink machine (from [ Manning & Sch¨ utze, 2000 ] ) P(Coke) = 0.6 Ice tea = 0.1 Lemon = 0.3 0.3 Coke Ice tea 0.7 0.5 Preference Preference 0.5 P(Coke) = 0.1 π Ice tea = 0.7 =1 CP Lemon = 0.2

  13. 12. A 4th example: A tiny HMM for 5’ splice site recognition (from [ Eddy, 2004 ] )

  14. 13. 3 Three fundamental questions for HMMs 1. Probability of an Observation Sequence: Given a model µ = ( A, B, Π) over S, K , how do we (effi- ciently) compute the likelihood of a particular sequence, P ( O | µ ) ? 2. Finding the “Best” State Sequence: Given an observation sequence and a model, how do we choose a state sequence ( X 1 , . . ., X T +1 ) to best explain the observation sequence? 3. HMM Parameter Estimation: Given an observation sequence (or corpus thereof), how do we acquire a model µ = ( A, B, Π) that best explains the data?

  15. 14. 3.1 Probability of an observation sequence P ( O | X, µ ) = Π T t =1 P ( O t | X t , X t +1 , µ ) = b X 1 X 2 O 1 b X 2 X 3 O 2 . . . b X T X T +1 O T � � π X 1 Π T P ( O, µ ) = P ( O | X, µ ) P ( X, µ ) = t =1 a X t X t +1 b X t X t +1 O t X X 1 ...X T +1 (2 T + 1) N T +1 , too inefficient Complexity : better : use dynamic prog. to store partial results α i ( t ) = P ( O 1 O 2 . . . O t − 1 , X t = s i | µ ) .

  16. 15. 3.1.1 Probability of an observation sequence: The Forward algorithm 1. Initialization: α i (1) = π i , for 1 ≤ i ≤ N 2. Induction: α j ( t + 1) = � N i =1 α i ( t ) a ij b ijO t , 1 ≤ t ≤ T , 1 ≤ j ≤ N 3. Total: P ( O | µ ) = � N i =1 α i ( T + 1) . Complexity: 2 N 2 T

  17. 16. Proof of induction step: α j ( t + 1) = P ( O 1 O 2 . . . O t − 1 O t , X t +1 = j | µ ) N � = P ( O 1 O 2 . . . O t − 1 O t , X t = i, X t +1 = j | µ ) i =1 N � = P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) i =1 N � = P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) i =1 N � = α i ( t ) P ( O t , X t +1 = j | X t = i, µ ) i =1 N N � � = α i ( t ) P ( O t | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) = α i ( t ) b ijO t a ij i =1 i =1

  18. 17. Closeup of the Forward update step s 1 α 1 (t) a 1j b 1jO t s 2 α 2 (t) µ P(O ... O , X = s | ) t+1 1 t j a 2j b 2jO s j t α j (t+1) µ P(O ... O , X = s | ) t 1 t−1 i s N a Nj b NjO α N (t) t t t+1

  19. 18. Trellis s 1 Each node ( s i , t ) stores informa- s 2 tion about paths through s i at time s 3 t . State s N 1 2 Time t T+1

  20. 19. 3.1.2 Probability of an observation sequence: The Backward algorithm β i ( t ) = P ( O t . . . O T | X t = i, µ ) 1. Initialization: β i ( T + 1) = 1 , for 1 ≤ i ≤ N 2. Induction: β i ( t ) = � N j =1 a ij b ijO t β j ( t + 1) , 1 ≤ t ≤ T , 1 ≤ i ≤ N 3. Total: P ( O | µ ) = � N i =1 π i β i (1) Complexity: 2 N 2 T

  21. The Backward algorithm: Proofs 20. Induction: β i ( t ) = P ( O t O t +1 . . . O T | X t = i, µ ) N � = P ( O t O t +1 . . . O T , X t +1 = j | X t = i, µ ) j =1 N � = P ( O t O t +1 . . . O T | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) j =1 N � = P ( O t +1 . . . O T | O t , X t = i, X t +1 = j, µ ) P ( O t | X t = i, X t +1 = j, µ ) a ij j =1 N N � � = P ( O t +1 . . . O T | X t +1 = j, µ ) b ijO t a ij = β j ( t + 1) b ijO t a ij j =1 j =1 N N � � Total: P ( O | µ ) = P ( O 1 O 2 . . . O T | X 1 = i, µ ) P ( X 1 = i | µ ) = β i (1) π i i =1 i =1

  22. 21. Combining Forward and Backward probabilities P ( O, X t = i | µ ) = α i ( t ) β i ( t ) N � P ( O | µ ) = α i ( t ) β i ( t ) for 1 ≤ t ≤ T + 1 i =1 Proofs: P ( O, X t = i | µ ) = P ( O 1 . . . O T , X t = i | µ ) = P ( O 1 . . . O t − 1 , X t = i, O t . . . O T | µ ) = P ( O 1 . . . O t − 1 , X t = i | µ ) P ( O t . . . O T | O 1 . . . O t − 1 , X t = i, µ ) = α i ( t ) P ( O t . . . O T | X t = i, µ ) = α i ( t ) β i ( t ) N N � � P ( O | µ ) = P ( O, X t = i | µ ) = α i ( t ) β i ( t ) i =1 i =1 Note: The “total” forward and backward formulae are special cases of the above one (for t = T + 1 and respectively t = 1 ).

  23. 22. 3.2 Finding the “best” state sequence 3.2.1 Posterior decoding One way to find the most likely state sequence underlying the observation sequence: choose the states individually γ i ( t ) = P ( X t = i | O, µ ) ˆ X t = argmax γ i ( t ) for 1 ≤ t ≤ T + 1 1 ≤ i ≤ N Computing γ i ( t ) : γ i ( t ) = P ( X t = i | O, µ ) = P ( X t = i, O | µ ) α i ( t ) β i ( t ) = � N P ( O | µ ) j =1 α j ( t ) β j ( t ) Remark: ˆ X maximizes the expected number of states that will be guessed cor- rectly. However, it may yield a quite unlikely/unnatural state se- quence.

  24. 23. Note Sometimes not the state itself is of interest, but some other property derived from it. For instance, in the CpG islands example, let g be a function defined on the set of states: g takes the value 1 for A + , C + , G + , T + and 0 for A − , C − , G − , T − . Then � P ( π t = s j | O ) g ( s j ) j designates the posterior probability that the symbol O t come from a state in the + set. Thus it is possible to find the most probable label of the state at each position in the output sequence O .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend