i
play

I i M i M i 5 6 Handling non-Global Alignments Original profile - PDF document

Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) Gives a probabilistic model of the proteins in the family Stephen D. Scott Useful for


  1. Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models • Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) • Gives a probabilistic model of the proteins in the family Stephen D. Scott • Useful for searching databases for more homologues and for aligning strings to the family 1 2 Organization of a Profile HMM Outline • Start with a trivial HMM M (not really hidden at this point) • Organization of a profile HMM 1 1 1 1 1 B M 1 M E i – Ungapped regions – Insert and delete states • Each match state has its own set of emission probs, so we can com- pute prob of a new sequence x being part of this family: L • Building a model P ( x | M ) = Y e i ( x i ) i =1 • Searching with HMMs • Can, as usual, convert probabilities to log-odds score 3 4 Organization of a Profile HMM (cont’d) Organization of a Profile HMM • But this assumes ungapped alignments! (cont’d) • To handle gaps, consider insertions and deletions - Deletion: parts of multiple alignment not matched by any residue in x (use silent delete states) – Insertion: part of x that doesn’t match anything in multiple align- ment (use insert states) D i I i M i M i 5 6

  2. Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- General Profile HMM Structure local residues B E B E 7 8 Building a Model Outline • Given a multiple alignment, how to build an HMM? • Organization of a profile HMM – General structure defined, but how many match states? ... V G A - - H A G E Y ... • Building a model ... V - - - - N V D E V ... – Structure ... V E A - - D V A G H ... ... V K G - - - - - - D ... – Estimating probabilities ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... • Searching with HMMs ... I A G A D N G A G V ... 9 10 Building a Model Building a Model (cont’d) (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? • Now, find parameters – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column • Multiple alignment + HMM structure ! state sequence ... V G A - - H A G E Y ... Non-gap in match column -> ... V - - - - N V D E V ... M1 D3 I 3 match state ... V E A - - D V A G H ... ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... ... V K G - - - - - - D ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V Y S - - T Y E T S ... ... V K G - - - - - - D ... insert state ... F N A - - N I P K H ... ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ... I A G A D N G A G V ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 10 9 11 12

  3. Weighted Pseudocounts Building a Model (cont’d) • Let c ja = observed count of residue a in position j of multiple alignment • Count number of transitions and emissions and compute: c ja + Aq a e M j ( a ) = A kl P a 0 c ja 0 + A a kl = P l 0 A kl 0 E k ( b ) • q a = background probability of a , A = weight placed on e k ( b ) = P b 0 E k ( b 0 ) pseudocounts (sometimes use A ⇡ 20 ) • Still need to beware of some counts = 0 • Background probabilities also called a prior distribution 13 14 Dirichlet Mixtures Dirichlet Mixtures (cont’d) • Can be thought of as a mixture of pseudocounts ↵ k (so ↵ k • Each component k consists of a vector of pseudocounts ~ a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the • The mixture has different components, each representing a different probability that component k is selected context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more • Pseudocount model k is the “correct” one with probability m k weight (higher q a ) can be given to hydrophilic residues – But in other regions, may want to give more weight to hydrophobic • We’ll set the mixture coefficients for each column based on which vec- residues tors best fit the residues in that column – E.g. first column of alignment on slide 10 is dominated by V, so any • Will find a different mixture for each position of the alignment based on ↵ k that favors V will get a higher m k vector ~ the distribution of residues in that column 15 16 Dirichlet Mixtures Dirichlet Mixtures (cont’d) (cont’d) • Let ~ c j be vector of counts in column j c ja + ↵ k a ⇣ ⌘ X e M j ( a ) = P k | ~ c j • Γ is gamma function, and ln Γ is computed via lgamma and related ⇣ ⌘ c ja 0 + ↵ k P k a 0 a 0 functions in C ⇣ ⌘ • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): • P k | ~ c j are the posterior mixture coefficients, which are easily com- puted [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a 0 X a 0 , P where ↵ k c ja + ~ ↵ k + ~ a ⇣ ⇣ ⌘ ⇣ ↵ k ⌘⌘ X a = X m k 0 exp ln B � ln B ⌘ , ~ c j ~ ⇣ c ja 0 + ↵ k P k a 0 a 0 0 1 X @X ln B ( ~ x ) = ln Γ ( x i ) � ln Γ x i A . . . i i 17 18

  4. Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, ⇡ ⇤ | M ) is probability that x came from model M via most Outline likely path ⇡ ⇤ ) Find using Viterbi • Organization of a profile HMM – Pr ( x | M ) is probability that x came from model M summed over all possible paths • Building a model ) Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | � )) • Searching with HMMs ⇤ � is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ⇤ If x matches M much better than � , then score is large and positive 19 20 Forward Equations Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D e M j ( x i ) j ( i ) ) ! F M h ⇣ F M ⌘ j ( i ) = log + log a M j � 1 M j exp j � 1 ( i � 1) + • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) q x i V M 8 j � 1 ( i � 1) + log a M j � 1 M j ⇣ F I ⌘ ⇣ F D ⌘i a I j � 1 M j exp j � 1 ( i � 1) + a D j � 1 M j exp j � 1 ( i � 1) e M j ( x i ) ! > > V M < V I ( i ) = log + max j � 1 ( i � 1) + log a I j � 1 M j j q x i > V D j � 1 ( i � 1) + log a D j � 1 M j > e I j ( x i ) : ! F I h ⇣ F M ⌘ j ( i ) = log + log a M j I j exp j ( i � 1) + V M 8 ( i � 1) + log a M j I j q x i j e I j ( x i ) > ! > V I < V I j ( i ) = log + max j ( i � 1) + log a I j I j ⇣ F I ⌘ ⇣ F D ⌘i a I j I j exp j ( i � 1) + a D j I j exp j ( i � 1) q x i V D > j ( i � 1) + log a D j I j > : F D h ⇣ F M ⌘ ⇣ F I ⌘ V M 8 j � 1 ( i ) + log a M j � 1 D j j ( i ) = log a M j � 1 D j exp j � 1 ( i ) + a I j � 1 D j exp j � 1 ( i ) > > V D < V I j � 1 ( i ) + log a I j � 1 D j j ( i ) = max ⇣ F D ⌘i + a D j � 1 D j exp j � 1 ( i ) V D > > j � 1 ( i ) + log a D j � 1 D j : • Similar to Chapter 2’s gapped alignment, but with position-specific • exp( · ) needed to use sums and logs (can still be fast; see p. 78) scoring scheme 21 22 Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path ⇡ ⇤ and use the state sequence as the alignment Topic summary due in 1 week! • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 23 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend