Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011

• Agenda The applications of HMM. One Standard Markov model (example: CG islands Discrimination) St d d M k d l Two Hidden Markov model (example: CG islands Detection) Hidden Markov model (example: CG islands Detection) Th Three Introduce Profile HMMs and PSSM Four Four Introduce Hmm databases and Hmmer3 Five

• The applications of HMM Speech Recognition Phoneme Iphone4s Siri : a voice-controlled artificial intelligence system Android 4.0: a real-time speech-to-text translation Biological sequence searching and aligning Nucleotides & AAs Hmmer 3 0: Hmmer 3.0: Used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments protein sequence alignments.

• CG island Example Cytosine (C) Thymine(T) 5-Methylcytosine In the human genome wherever the dinucleotide CG, the C is typically chemically modified by methylation. There is a relatively high chance of this methyl-C mutating into a T, with the consequence that in general CG dinucleotides are rarer in the genome. For biologically th t i l CG di l tid i th F bi l i ll important reasons the methylation process is suppressed in short stretches of the genome, such as around the promoters or ‘start' regions of many genes. In these regions we see many more CG dinucleotides than elsewhere, and in fact more C and G nucleotides in general. Such regions are called CG islands . Given a short stretch of genomic sequence ， how would we decide if it comes from a CG island or not? How do we find the CG islands in a long unannotated sequence?

• Standard Markov Model (introduction) ATCGCCGATGGTAATGCCTT ATCGCCGATGGTAATGCCTT Length (L) = 20 Length (L) 20  States: A, T ,C, G Bayes rule Bayes rule  Transition probability: P(X,Y) = P(X|Y)P(Y) P AT = The probability of A follow by T  Sequence probability:  Sequence probability: P(X 1 ) = ??? Exercise: Based on the above markov chain the sum of the Based on the above markov chain, the sum of the Marko propert Markov property probability of all possible sequences of length L is P(X i |X i-1 ,…X 1 ) = P(X i |X i-1 ) equal to 1.

• Standard Markov Model (with begin and end state) P(X 1 ) = ??? ATCGCCGATGGTAATGCCTT Length (L) = 20  St t  States: B,A,T,C,G,E  Transition probability: P�X 1 � A� � P BA : The probability the sequence begin with A P�X 1 � A� � P BA : The probability the sequence begin with A. P�E|X L � T� � P TE : The probability the sequence end with T.  Sequence probability (with length L): Exercise: Assume that the model has an end state, and that the transition from any state to the end state has probability ε . Show that the sum of the probability over all sequences of length L (and properly terminating by making a ll f l th L ( d l t i ti b ki transition to the end state) is ε (1- ε ) L-1 . Use this result to show that the sum of the probability over all possible sequences of any length is 1.

• Standard Markov Model (CG islands Discrimination) Given a short stretch of genomic sequence Given a short stretch of genomic sequence ， how would we decide if it comes from a CG island or not? From a set of human DNA sequences we extracted a total of 48 putative CG islands and derived two Markov chain models, one for the regions CG i l d d d i d t M k h i d l f th i labelled as CG islands (the ”+” model) and the other from the remainder of the sequence (the ’-’ model). Model + M d l + Model - M d l + A C G T - A C G T A A A A 0.180 0.274 0.426 0.120 0.300 0.205 0.285 0.210 C C 0.171 0.368 0.274 0.188 0.322 0.298 0.078 0.302 G G G G 0.161 0.339 0.375 0.125 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 P(x|Model -) P(x|Model +) Log-odds score: S(x) = log(P(x|Model +)/P(x|Model -)) g ( ) g( ( | )/ ( | ))

• Hidden Markov Model (why HMM) How do we find the CG islands in a long unannotated sequence? • We can use Standard Markov Model to • We can use Standard Markov Model to calculate the log-odds score for a window of, say, 100 nucleotides around every nucleotide in the sequence and plotting it. the sequence and plotting it. 100 why? 100 why? We need a single Model to incorporates Model+ and Model- HMM HMM

• Hidden Markov Model (introduction) State path T- C+ G+ C+ C- ….. (  ) Sequence T C G C C ….. path (x) The essential difference between a Markov chain and a hidden Markov Model is that The essential difference between a Markov chain and a hidden Markov Model is that for a hidden Markov model there is not a one-to-one correspondence between the states and the symbols. (For example state C+ and C- both emit symbol C). Therefore we need to distinguish the sequence of states from the sequence of symbols  States: A+,A-,T+,T-,C+,C-,G+,G-  Symbols  Symbols A,T,C,G  Transition probability a kl = P(  i = l |  i = k) ( | ) kl i i  Emission probability e k (b) = P(x i =b |  i = k)  Sequence probability (with state path  ): q p y ( p ) P(x,  ) = a k  1 ∏ i=1toL e  i (x i ) a  i  i+1

• Hidden Markov Model (The most probable path) Many state paths can generate the target sequence!!! Many state paths can generate the target sequence!!!  1 T+ C+ G- C+ C- …..  2 T- C+ G+ C+ C+ ….. T- C- G+ C- C- …..  3 x T C G C C ….. The most probable path :  * = argmax P(x,  ) Viterbi Algorithm V (i+1) =e (x V l (i+1) =e l (x i+1 )max(V k (i)a kl ) )max(V (i)a ) • V k (i): The probability of most probable path up to x i ending in state k. • V l (i+1) : The probability of most probable path up to x i+1 ending in state l. • a kl Transition probability from state k to state l a kl: Transition probability from state k to state l • e l (x i+1 ) : Emission probability (x i+1 emits from state l) Initialization: Initialization: i=0, V 0 (0) = 1, V k (0) =0 i=0 V 0 (0) = 1 V k (0) =0 Termination: P(x,  *) = Max k (V k (L) a k0 ),  * = argmax  (V k (L) a k0 )) Note: The start and end state both are 0

• Hidden Markov Model (the full sequence probability) We must add the probabilities for all possible paths to obtain p p p the full probability of x.  1 T- C+ G+ C+ C- …..  2 T+ C- G+ C- C- …..  3 T- C- G-+ C+ C- ….. x T C G C C ….. The full probability of X : P(x) = ∑  P(x,  ) Forward Algorithm f (i f h (i+1) =e h (x i+1 ) ∑ k f k (i) a kh ) ( ) f (i) • f k (i): The probability of the sequence up to x i ending in state k. P(x 1… x i ,  i =k) • f h (i+1) : The probability of the sequence up to x i+1 ending in state h. h ( ) p y q p g i+1 • a kh : Transition probability from state k to state h • e h (x i+1 ) : Emission probability (x i+1 emits from state h) Initialization: i=0 f (0) = 1 f (0) =0 Initialization: i=0, f 0 (0) = 1, f k (0) =0 Termination: P(x) = ∑ k f h (i+1) a k0 Note: The start and end state both are 0

• Hidden Markov Model (the posterior state probabilities) What if different paths have almost the same probability as the p p y most probable one? We need posterior decoding The posterior probability: P(  i =k|x) The posterior probability: P(  i k|x) P(  i =k|x) = P(x,  i =k) / P(x) P(x) : by forward algorithm P(x,  i =k) = P(x 1 ….x i ,  i =k)P(x i+1 …x L |x 1 ...x i ,  i =k) = f k (i)P(x i+1 …x L |  i =k) = f k (i) b k (i) f k (i) : by forward algorithm b k (i) : by backward algorithm f (i) b f d l ith b (i) b b k d l ith backward Algorithm Recursion: b k (i) = ∑ h a kh e h (x i+1 ) b h (i+1) • b k (i): P(x i+1…… x L |  i =k) • b h (i+1) : P(x i+2…… x L |  i+1 =h) b (i 1) P( | h) • a kh: Transition probability from state k to state h • e h (x i+1 ) : Emission probability (x i+1 emits from state h) I iti li Initialization: b k (L) = a k0 ti b (L) Termination: P(x) = ∑ h a 0h e h (x i+1 )b h (1) Note: The start and end state both are 0

• Hidden Markov Model (parameter estimation) How we Specify the model in the first place? How we Specify the model in the first place? Step1: Design the structure (states,connections) Setp2: Estimate the transition a kh and emission e k (b) probabilities Setp2: Estimate the transition a kh and emission e k (b) probabilities. E ti Estimation when the state sequence is known ti h th t t i k • a kh = A kh / ∑ h’ A kh’ • e k (b) = E k (b) / ∑ b’ E k (b’) A kh : number of transitions k to h in tr aining data + r kh E k (b): number of emissions of b fr aining data + r k (b) om k in tr Note: r kh and r k (b) are pseudocounts. Estimation when the state sequence is unknown • Baum-Welch algorithm • Viterbi training

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 - PowerPoint PPT Presentation

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM. One Standard Markov model (example: CG islands Discrimination) St d d M k d l Two Hidden Markov model (example: CG islands Detection) Hidden

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Non-Homogeneous Hidden Markov Model Qingyuan Liu Introduction (Why Homogeneous HMM) Classify

A Priori Error Analysis of Fully Discrete Elliptic model problem First convergence results

Hidden Markov Models (HMM) Many slides from Michael Collins and HMMs Overview I The Tagging

Kidney Exchange With an emphasis on computation & work from CMU John P. Dickerson (in lieu of

Interactive HMM construction based on interesting sequences Szymon Jaroszewicz National

Efficient Implementation of a Generalized Pair HMM for Efficient Implementation of a Generalized

9: Viterbi Algorithm for HMM Decoding Machine Learning and Real-world Data Helen Yannakoudakis 1

9: Viterbi Algorithm for HMM Decoding Machine Learning and Real-world Data Simone Teufel and Ann

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Whole Genome Comparison: Project Presentations Felix Heeger, Max Homilius, Ivan Kel, Sabrina

Method development for DNA methylation on a single cell level The project is carried out in TATAA

1 approached the problem of making this happen on the GPU. When you sequence an individuals

Corporate Presentation 2018 FORWARD LOOKING STATEMENT This presentation is intended to be

RamA is involved in the control of membrane permeability in multidrugresistant (MDR) E. aerogenes

NIKHIL.K.POTDUKHE Outline of UV spectrophotometer Outline of Recombinant DNA technology

Drive to Success Summer 2019 I can Inspirational story of an I can approach to a

ts