introduction to hmm introduction to hmm
play

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 - PowerPoint PPT Presentation

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM. One Standard Markov model (example: CG islands Discrimination) St d d M k d l Two Hidden Markov model (example: CG islands Detection) Hidden


  1. Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011

  2. • Agenda The applications of HMM. One Standard Markov model (example: CG islands Discrimination) St d d M k d l Two Hidden Markov model (example: CG islands Detection) Hidden Markov model (example: CG islands Detection) Th Three Introduce Profile HMMs and PSSM Four Four Introduce Hmm databases and Hmmer3 Five

  3. • The applications of HMM Speech Recognition Phoneme Iphone4s Siri : a voice-controlled artificial intelligence system Android 4.0: a real-time speech-to-text translation Biological sequence searching and aligning Nucleotides & AAs Hmmer 3 0: Hmmer 3.0: Used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments protein sequence alignments.

  4. • CG island Example Cytosine (C) Thymine(T) 5-Methylcytosine In the human genome wherever the dinucleotide CG, the C is typically chemically modified by methylation. There is a relatively high chance of this methyl-C mutating into a T, with the consequence that in general CG dinucleotides are rarer in the genome. For biologically th t i l CG di l tid i th F bi l i ll important reasons the methylation process is suppressed in short stretches of the genome, such as around the promoters or ‘start' regions of many genes. In these regions we see many more CG dinucleotides than elsewhere, and in fact more C and G nucleotides in general. Such regions are called CG islands . Given a short stretch of genomic sequence , how would we decide if it comes from a CG island or not? How do we find the CG islands in a long unannotated sequence?

  5. • Standard Markov Model (introduction) ATCGCCGATGGTAATGCCTT ATCGCCGATGGTAATGCCTT Length (L) = 20 Length (L) 20  States: A, T ,C, G Bayes rule Bayes rule  Transition probability: P(X,Y) = P(X|Y)P(Y) P AT = The probability of A follow by T  Sequence probability:  Sequence probability: P(X 1 ) = ??? Exercise: Based on the above markov chain the sum of the Based on the above markov chain, the sum of the Marko propert Markov property probability of all possible sequences of length L is P(X i |X i-1 ,…X 1 ) = P(X i |X i-1 ) equal to 1.

  6. • Standard Markov Model (with begin and end state) P(X 1 ) = ??? ATCGCCGATGGTAATGCCTT Length (L) = 20  St t  States: B,A,T,C,G,E  Transition probability: P�X 1 � A� � P BA : The probability the sequence begin with A P�X 1 � A� � P BA : The probability the sequence begin with A. P�E|X L � T� � P TE : The probability the sequence end with T.  Sequence probability (with length L): Exercise: Assume that the model has an end state, and that the transition from any state to the end state has probability ε . Show that the sum of the probability over all sequences of length L (and properly terminating by making a ll f l th L ( d l t i ti b ki transition to the end state) is ε (1- ε ) L-1 . Use this result to show that the sum of the probability over all possible sequences of any length is 1.

  7. • Standard Markov Model (CG islands Discrimination) Given a short stretch of genomic sequence Given a short stretch of genomic sequence , how would we decide if it comes from a CG island or not? From a set of human DNA sequences we extracted a total of 48 putative CG islands and derived two Markov chain models, one for the regions CG i l d d d i d t M k h i d l f th i labelled as CG islands (the ”+” model) and the other from the remainder of the sequence (the ’-’ model). Model + M d l + Model - M d l + A C G T - A C G T A A A A 0.180 0.274 0.426 0.120 0.300 0.205 0.285 0.210 C C 0.171 0.368 0.274 0.188 0.322 0.298 0.078 0.302 G G G G 0.161 0.339 0.375 0.125 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 P(x|Model -) P(x|Model +) Log-odds score: S(x) = log(P(x|Model +)/P(x|Model -)) g ( ) g( ( | )/ ( | ))

  8. • Hidden Markov Model (why HMM) How do we find the CG islands in a long unannotated sequence? • We can use Standard Markov Model to • We can use Standard Markov Model to calculate the log-odds score for a window of, say, 100 nucleotides around every nucleotide in the sequence and plotting it. the sequence and plotting it. 100 why? 100 why? We need a single Model to incorporates Model+ and Model- HMM HMM

  9. • Hidden Markov Model (introduction) State path T- C+ G+ C+ C- ….. (  ) Sequence T C G C C ….. path (x) The essential difference between a Markov chain and a hidden Markov Model is that The essential difference between a Markov chain and a hidden Markov Model is that for a hidden Markov model there is not a one-to-one correspondence between the states and the symbols. (For example state C+ and C- both emit symbol C). Therefore we need to distinguish the sequence of states from the sequence of symbols  States: A+,A-,T+,T-,C+,C-,G+,G-  Symbols  Symbols A,T,C,G  Transition probability a kl = P(  i = l |  i = k) ( | ) kl i i  Emission probability e k (b) = P(x i =b |  i = k)  Sequence probability (with state path  ): q p y ( p ) P(x,  ) = a k  1 ∏ i=1toL e  i (x i ) a  i  i+1

  10. • Hidden Markov Model (The most probable path) Many state paths can generate the target sequence!!! Many state paths can generate the target sequence!!!  1 T+ C+ G- C+ C- …..  2 T- C+ G+ C+ C+ ….. T- C- G+ C- C- …..  3 x T C G C C ….. The most probable path :  * = argmax P(x,  ) Viterbi Algorithm V (i+1) =e (x V l (i+1) =e l (x i+1 )max(V k (i)a kl ) )max(V (i)a ) • V k (i): The probability of most probable path up to x i ending in state k. • V l (i+1) : The probability of most probable path up to x i+1 ending in state l. • a kl Transition probability from state k to state l a kl: Transition probability from state k to state l • e l (x i+1 ) : Emission probability (x i+1 emits from state l) Initialization: Initialization: i=0, V 0 (0) = 1, V k (0) =0 i=0 V 0 (0) = 1 V k (0) =0 Termination: P(x,  *) = Max k (V k (L) a k0 ),  * = argmax  (V k (L) a k0 )) Note: The start and end state both are 0

  11. • Hidden Markov Model (the full sequence probability) We must add the probabilities for all possible paths to obtain p p p the full probability of x.  1 T- C+ G+ C+ C- …..  2 T+ C- G+ C- C- …..  3 T- C- G-+ C+ C- ….. x T C G C C ….. The full probability of X : P(x) = ∑  P(x,  ) Forward Algorithm f (i f h (i+1) =e h (x i+1 ) ∑ k f k (i) a kh ) ( ) f (i) • f k (i): The probability of the sequence up to x i ending in state k. P(x 1… x i ,  i =k) • f h (i+1) : The probability of the sequence up to x i+1 ending in state h. h ( ) p y q p g i+1 • a kh : Transition probability from state k to state h • e h (x i+1 ) : Emission probability (x i+1 emits from state h) Initialization: i=0 f (0) = 1 f (0) =0 Initialization: i=0, f 0 (0) = 1, f k (0) =0 Termination: P(x) = ∑ k f h (i+1) a k0 Note: The start and end state both are 0

  12. • Hidden Markov Model (the posterior state probabilities) What if different paths have almost the same probability as the p p y most probable one? We need posterior decoding The posterior probability: P(  i =k|x) The posterior probability: P(  i k|x) P(  i =k|x) = P(x,  i =k) / P(x) P(x) : by forward algorithm P(x,  i =k) = P(x 1 ….x i ,  i =k)P(x i+1 …x L |x 1 ...x i ,  i =k) = f k (i)P(x i+1 …x L |  i =k) = f k (i) b k (i) f k (i) : by forward algorithm b k (i) : by backward algorithm f (i) b f d l ith b (i) b b k d l ith backward Algorithm Recursion: b k (i) = ∑ h a kh e h (x i+1 ) b h (i+1) • b k (i): P(x i+1…… x L |  i =k) • b h (i+1) : P(x i+2…… x L |  i+1 =h) b (i 1) P( | h) • a kh: Transition probability from state k to state h • e h (x i+1 ) : Emission probability (x i+1 emits from state h) I iti li Initialization: b k (L) = a k0 ti b (L) Termination: P(x) = ∑ h a 0h e h (x i+1 )b h (1) Note: The start and end state both are 0

  13. • Hidden Markov Model (parameter estimation) How we Specify the model in the first place? How we Specify the model in the first place? Step1: Design the structure (states,connections) Setp2: Estimate the transition a kh and emission e k (b) probabilities Setp2: Estimate the transition a kh and emission e k (b) probabilities. E ti Estimation when the state sequence is known ti h th t t i k • a kh = A kh / ∑ h’ A kh’ • e k (b) = E k (b) / ∑ b’ E k (b’) A kh : number of transitions k to h in tr aining data + r kh E k (b): number of emissions of b fr aining data + r k (b) om k in tr Note: r kh and r k (b) are pseudocounts. Estimation when the state sequence is unknown • Baum-Welch algorithm • Viterbi training

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend