csce 970 lecture 3 hmm application biological sequence
play

CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis - PowerPoint PPT Presentation

CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis Stephen D. Scott 1 Introduction Idea: Given a collection S of related biological sequences, build a (pro- file) hidden Markov model M to generate the sequences Then test


  1. CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis Stephen D. Scott 1

  2. Introduction • Idea: Given a collection S of related biological sequences, build a (pro- file) hidden Markov model M to generate the sequences • Then test new sequence X against M using (which algorithm?) to pre- dict X ’s membership in S • Can also align X against M using (which algorithm?) to see how X matches up position by position against sequences in S 2

  3. Introduction (cont’d) • Will build M based on a multiple alignment of sequences in S : ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... • In alignments, will differentiate matches, insertions, and deletions 3

  4. Outline • Ungapped regions • Insert and delete states • Deriving profile HMMs from multiple alignments • Searching with profile HMMs • Variants for non-global alignments • Estimating probabilities 4

  5. Organization of a Profile HMM • Start with a trivial HMM M (not really hidden at this point) 1 1 1 1 1 M 1 M B E i • Each match state has its own set of emission probs, so we can com- pute prob of a new sequence x being part of this family: L � P ( x | M ) = e i ( x i ) i =1 5

  6. Organization of a Profile HMM (cont’d) • But this assumes ungapped alignments! • To handle gaps, consider insertions and deletions – Insertion: part of x that doesn’t match anything in multiple align- ment (use insert states) I i M i 6

  7. Organization of a Profile HMM (cont’d) - Deletion: parts of multiple alignment not matched by any residue (sym- bol) in x (use silent delete states) D i M i 7

  8. General Profile HMM Structure B E 8

  9. Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- local residues B E 9

  10. Building a Model • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 10

  11. Building a Model (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 11

  12. Building a Model (cont’d) • Now, find parameters • Multiple alignment + HMM structure → state sequence Non-gap in match column -> M1 D3 I3 match state ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V K G - - - - - - D ... insert state ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 109 12

  13. Building a Model (cont’d) • Count number of transitions and emissions and compute: A kl a kl = � l ′ A kl ′ E k ( b ) e k ( b ) = b ′ E k ( b ′ ) � • Still need to beware of some counts = 0 13

  14. Weighted Pseudocounts • Let c ja = observed count of residue a in position j of multiple align- ment c ja + Aq a e M j ( a ) = � a ′ c ja ′ + A • q a = background probability of a , A = weight placed on pseudo- counts (sometimes use A ≈ 20 ) • Also called a prior distribution 14

  15. Dirichlet Mixtures • Can be thought of a mixture of pseudocounts • The mixture has different components, each representing a different context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more weight (higher q a ) can be given to hydrophilic residues (ones that readily bind with water) • Will find a different mixture for each position of the alignment based on the distribution of residues in that column 15

  16. Dirichlet Mixtures (cont’d) α k (so α k • Each component k consists of a vector of pseudocounts � a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the probability that component k is selected • Pseudocount model k is the “correct” one with probability m k • We’ll set the mixture coefficients for each column based on which vec- tors best fit the residues in that column – E.g. first column of our example alignment is dominated by V, so α k that favors V will get a higher m k any vector � 16

  17. Dirichlet Mixtures (cont’d) • Let � c j be vector of counts in column j c ja + α k a � � � e M j ( a ) = P k | � c j � � c ja ′ + α k � a ′ k a ′ � � • P k | � c j are the posterior mixture coefficients, which are easily com- puted [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a ′ X a ′ , � where α k c ja + � a � � α k � � α k �� � � , X a = m k 0 exp ln B � a + � c j − ln B � a � c ja ′ + α k � a ′ k a ′   � � ln B ( � x ) = ln Γ( x i ) − ln Γ x i  i i 17

  18. Dirichlet Mixtures (cont’d) • Γ is gamma function, and ln Γ is computed via lgamma and related functions in C • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): . . . 18

  19. Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, π ∗ | M ) is probability that x came from model M via most likely path π ∗ ⇒ Find using Viterbi – Pr ( x | M ) is probability that x came from model M summed over all possible paths ⇒ Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | φ )) ∗ φ is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ∗ If x matches M much better than φ , then score is large and positive 19

  20. Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D j ( i ) ) • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) V M  j − 1 ( i − 1) + log a M j − 1 M j � e M j ( x i )  �  V M  V I j − 1 ( i − 1) + log a I j − 1 M j ( i ) = log + max j q x i V D  j − 1 ( i − 1) + log a D j − 1 M j   V M  ( i − 1) + log a M j I j j � e I j ( x i ) �    V I V I j ( i − 1) + log a I j I j j ( i ) = log + max q x i V D  j ( i − 1) + log a D j I j   V M  j − 1 ( i ) + log a M j − 1 D j    V D V I j − 1 ( i ) + log a I j − 1 D j j ( i ) = max V D  j − 1 ( i ) + log a D j − 1 D j   20

  21. Forward Equations � e M j ( x i ) � F M � � F M � j ( i ) = log + log a M j − 1 M j exp j − 1 ( i − 1) + q x i � F I � � F D �� a I j − 1 M j exp j − 1 ( i − 1) + a D j − 1 M j exp j − 1 ( i − 1) � e I j ( x i ) � F I � � F M � j ( i ) = log + log a M j I j exp j ( i − 1) + q x i � F I � � F D �� a I j I j exp j ( i − 1) + a D j I j exp j ( i − 1) F D � � F M � � F I � j ( i ) = log a M j − 1 D j exp j − 1 ( i ) + a I j − 1 D j exp j − 1 ( i ) � F D �� + a D j − 1 D j exp j − 1 ( i ) • exp( · ) needed to use sums and logs 21

  22. Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path π ∗ and use the state sequence as the alignment • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend