CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited - PowerPoint PPT Presentation

CSE182-L9 Modeling Protein domains using HMMs

Profiles Revisited • Note that profiles are a powerful way of capturing domain information • Pr(sequence x| profile P) = Pr(x 1 |P 1 ) P(x 2 |P 2 )… • Pr(AGCTTGTA|P) = 0.9*0.2*0.7*0.4*0.3*…. • Why do we want a different formalism? 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

Profiles might need to be extended • A Domain might only be a part of a sequence, and we need to identify and score that part – Pr[x|P] = max i Pr[x i …x i+l | P] • A sequence containing a domain might contain indels in the domain – EX: AACTTCGGA 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 A A C T T C G G A A A C T T C G G A

Profiles & Indels • Indels can be allowed in matching sequences to a profile. • However, how do we construct the profile in the first place? – Without indels, multiple alignment of a sequence is trivial – Multiple alignments in the presence of indels is much trickier. Similarly, construction of profiles is harder. – The HMM formalism allows you to do both, alignment of single a sequence to the HMM, and construction of the HMM given unaligned sequences.

Profiles and Compositional signals • While most protein domains have position dependent properties, other biological signals do not. • Ex: 1. CpG islands, 2. Signals for protein targeting, transmembrane etc.

Protein targeting • Proteins act in specific compartments of the cell, often distinct from where it is ‘manufactured’. • How does the cellular machinery know where to send each protein.

Protein targeting • In 1970, Gunter Blobel showed that proteins have an N-terminal signal sequence which directs proteins to the membrane. • Proteins have to be transported to other organelles: nucleus, mitochondria,… • Can we computationally identify the ‘signal’ which distinguishes the cellular compartment?

Predicting Transmembrane Regions • For transmembrane proteins, can we predict the transmembrane, outer, and inner regions? transmembrane outer inner These signals are of varying length and depend more on compositional, rather than position dependent properties

The answer is HMMs • Certainly, it is not the only answer! Many other formalisms have been proposed, such as Neural Networks, with varying degree of success.

Revisiting automaton • Recall that Profiles were introduced as a generalization of automata • The OR operator is replaced by a distribution. However, the * operator (Kleene’s closure) has no direct analog. • Instead we introduce probabilities directly into automata.

Profile HMMs • Problem: Given Sequence x, Profile P of length l, – compute Pr[x|P] = max i Pr[x i …x i+l | P] • Construct an automaton with a state (node) for every column • Extra states at the beginning and end. • In each step, the automaton emits a symbol, and moves to a new state. Unlike finite automata, the emission and transistion are governed by probabilistic rules 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

Emission Probability for states • Each state π emits a base (residue) with a probability (defined by the profile) • Thus Pr(A|s 1 ) = 0.9, Pr(T| s 4 )=0.4,….. 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 s 9 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8

Transition probabilities • The probability of emitting a base depends upon a state. • An initial state π 0 is defined. We move to subsequent states using a probabilistic transition • EX: Pr(s 0 -> s 1 )= Pr(s 0 -> s 0 )=0.5, Pr(s 1 -> s 2 ) = 1.0 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 s 9 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8

Emission of a sequence • What is the probability that the automaton emitted the sequence x 1 ..x n • The problem becomes easier if we know the sequence of states π = π 0 .. π n that the automaton was in ’ Pr[ x | p ] = P ( p 0 Æ p 1 ) Pr( x i | p i ) P ( p i Æ p i + 1 ) i

Computing Pr(x| π ) • Example: Consider the sequence GGGAACTTGTACCC – X = G G G A A C T T G T A C C C – π = 0 0 0 0 1 2 3 4 5 6 7 8 9 9 9 • Pr(x i |π i ): .25 .25 .25 .9 .4 .7 .4 .3 1 .5 1 .25 .25 .25 Pr( π i ->π i+1 )= .5 .5 .5 .5 1 1 1 1 1 1 1 1 1 1 • 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 s 9 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8

HMM: Formal definition • M=(∑,Q,A,E), where • ∑=is the alphabet (EX: {A,C,G,T}) • Q: set of states, with an initial state q 0 , and perhaps, a final state. (EX: {s 0 ,…,s 9 }) • A: Transition Probabilities, A[i,j] = Pr(s i ->s j ) (EX: A[0,1]=0.5) • E: Emission probabilities e i (b) = Pr(b|s i ) (EX: e 3 (T)=0.7) 1 2 3 4 5 6 7 8 A 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 C 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 G 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 T 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 s 9 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8

Pr(x|M) • Given the state sequence π , we know how to compute Pr(x| π ). • We would like to compute – Pr(x|M) = max π Pr(x| π ) • Let |x| = n, and |Q|=m, with s m being the final state. • As is common in Dynamic programming, we consider the probability of generating a prefix of x • Define v(i,j) = Probability of generating x 1 ..x i , and ending in state s j • Is it sufficient to compute v(i,j) for all i,j? – Yes, because Pr(x|M) = v(n,m)

Computing v(i,j) s l s j • If the previous state (at time i-1) was s l , – then v(i,j) = v(i-1,l).A[l,j].e j (x i ) • The previous state must be one of the m states. • Therefore v(i,j) = max l Œ Q {v(i-1,l).A[l,j] }.e j (x i )

The Viterbi Algorithm • v(i,j) = max l Œ Q {v(i-1,l).A[l,j] }.e j (x i ) • We take logarithms and an approximation to maintain precision – S(i,j) = log(e j (x i )) + max l Œ Q { S[i-1,l] + log(A[l,j] ) } • The Viterbi Algorithm • For i = 1..n – For j = 1.. M – S(i,j) = log(e j (x i )) + max l Œ Q { S[i-1,l] + log(A[l,j] ) }

Probability of being in specific states • What is the probability that we were in state k at step I? Pr[All paths that passed through state k at step I, and emitted x] Pr[All paths that emitted x] = Pr[ x , p i = k ] Pr[ x ]

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited - PowerPoint PPT Presentation

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x| profile P) = Pr(x 1 |P 1 ) P(x 2 |P 2 ) Pr(AGCTTGTA|P) =

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

C LASSIFICATION OF S TRUCTURED O BJECTS I O H H C H H Methanol molecule 3 C LASSIFICATION

M ODEL C HECKING OF B IOCHEMICAL N ETWORKS U SING P ETRI N ETS Ina Koch Monika Heiner Technical

Rational Design of Organelle Compartments in Cells Claudio Angione Nettab 2012 Claudio Angione

Mitochondrial dysfunction and cancer metabolites and beyond Christian Frezza MRC Cancer Unit,

Programming MapReduce in Mathematica Paul-Jean Letourneau Data Scientist, Wolfram Research

} } 3. Take samples of HIV from other HIV positive people from the 3 Take samples of HIV from

Prediction in kernelized output spaces: output kernel trees and ensemble methods Pierre Geurts

Impact of Disclosures WHO 2017 and AJCC 8 th edition Shareholder Five Prime Therapeutics and

Sambuz

Useful Links

Newsletter

Mail Us

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited - PowerPoint PPT Presentation

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x| profile P) = Pr(x 1 |P 1 ) P(x 2 |P 2 ) Pr(AGCTTGTA|P) =

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast &amp; variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

C LASSIFICATION OF S TRUCTURED O BJECTS I O H H C H H Methanol molecule 3 C LASSIFICATION

M ODEL C HECKING OF B IOCHEMICAL N ETWORKS U SING P ETRI N ETS Ina Koch Monika Heiner Technical

Rational Design of Organelle Compartments in Cells Claudio Angione Nettab 2012 Claudio Angione

Mitochondrial dysfunction and cancer metabolites and beyond Christian Frezza MRC Cancer Unit,

Programming MapReduce in Mathematica Paul-Jean Letourneau Data Scientist, Wolfram Research

} } 3. Take samples of HIV from other HIV positive people from the 3 Take samples of HIV from

Prediction in kernelized output spaces: output kernel trees and ensemble methods Pierre Geurts

Impact of Disclosures WHO 2017 and AJCC 8 th edition Shareholder Five Prime Therapeutics and

Sambuz

Useful Links

Newsletter

Mail Us

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes