CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) - PowerPoint PPT Presentation

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding November 09 CSE 182

QUIZ! • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say if he is cheating? • What fraction of the times does he load the coin? November 09 CSE 182

Regular expressions as motifs • What is a regular expression? • Given a regular expression pattern and a database, find all sequences that match the pattern. • Given a sequence as query, and a database of r.e. patterns, find all of the patterns in the sequence. • http://ca.expasy.org/prosite/ November 09 CSE 182

Regular Expressions • Concise representation of a set of strings over alphabet ∑ . • Described by a string over { } Σ , ⋅ , ∗ , + • R is a r.e. if and only if R = { ε } Base case R = { σ }, σ ∈ Σ R = R 1 + R 2 Union of strings R = R 1 ⋅ R 2 Concatenation * 0 or more repetitions R = R 1 Fa 07 CSE182

Regular Expression • Q: Let ∑ ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? • Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE? Fa 07 CSE182

Regular Expression & Automata  Every R.E can be expressed by an automaton (a directed graph) with the following properties: – The automaton has a start and end node – Each edge is labeled with a symbol from ∑ , or ε  Suppose R is described by automaton A  S ∈ R if and only if there is a path from start to end in A, labeled with s. Fa 07 CSE182

Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C Fa 07 CSE182

Constructing automata from R.E • R = { ε } • R = { σ }, σ ∈ ∑ • R = R 1 + R 2 • R = R 1 · R 2 • R = R 1 * November 09 CSE 182

Matching Regular expressions • A string s belongs to R if and only if, there is a path from START to END in R A , labeled by s. • Given a regular expression R (automaton R A ), and a database D, is there a string D[b..c] that matches R A (D[b..c] ∈ R) • Simpler Q: Is D[1..c] accepted by the automaton of R? November 09 CSE 182

Alg. For matching R.E. • If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A D[1] D[2] D[c] November 09 CSE 182

Alg. For matching R.E. • If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A – There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] November 09 CSE 182

D.P. to match regular expression • Define: – A[u, σ ] = Automaton node reached from u after reading σ – Eps(u): set of all nodes ε reachable from node u using epsilon transitions. – N[c] = subset of nodes reachable from START node after reading D[1..c] – Q: when is v ∈ N[c] November 09 CSE 182

D.P. to match regular expression • Q: when is v ∈ N[c]? • A: If for some u ∈ N[c-1], w = A[u,D[c]], • v ∈ {w}+ Eps(w) November 09 CSE 182

Algorithm November 09 CSE 182

The final step • We have answered the question: – Is D[1..c] accepted by R? – Yes, if END ∈ N[c] • We need to answer – Is D[l..c] (for some l, and some c) accepted by R D [ l .. c ] ∈ R ⇔ D [1.. c ] ∈ Σ ∗ R November 09 CSE 182

Regular expressions as Protein sequence motifs C-X-[DE]-X{10,12}-C-X-C--[STYLV] Fam(B) A C E F • Problem: if there is a mis-match, the sequence is not accepted. November 09 CSE 182

Representation 2: Profiles • Profiles versus regular expressions – Regular expressions are intolerant to an occasional mis-match. – The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. – Profiles capture some of these ideas. November 09 CSE 182

Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(f ki ) 0.71 0.14 • Each entry f ki represents the frequency of symbol k in position i 0.28 0.14 November 09 CSE 182

Scoring matrices • Given a sequence s, does it i belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue s j • The score of an alignment is the sum of column s scores. s j November 09 CSE 182

Scoring Profiles ∑ [ ] S ( i , j ) = f ki M r k , s j k Scoring Matrix i k f ki s November 09 CSE 182

Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any known profile? November 09 CSE 182

Psi-BLAST idea Seq Db --In the next iteration, the red sequence will be thrown out. --It matches the query in non-essential residues • Iterate: – Find homologs using Blast on query – Discard very similar homologs – Align, make a profile, search with profile. – Why is this more sensitive? November 09 CSE 182

Psi-BLAST speed • Two time consuming steps. 1. Multiple alignment of homologs 2. Searching with Profiles. 1. Does the keyword search idea work? • Pigeonhole principle again: – If profile of length m must score >= T • Multiple alignment: – Then, a sub-profile of length l must – Use ungapped multiple score >= lT|/m alignments only – Generate all l-mers that score at least lT|/M – Search using an automaton November 09 CSE 182

Representation 3: HMMs • Building good profiles relies upon good alignments. – Difficult if there are gaps in the V alignment. – Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/ membership query in a uniform framework. • Also allows for position specific gap scoring. November 09 CSE 182

QUIZ! • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? November 09 CSE 182

The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a 0.71 residue with the appropriate distribution Pr[Y]=0.14 Pr[F]=0.71 0.14 November 09 CSE 182

A simple Profile HMM • Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation – Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? November 09 CSE 182

Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. November 09 CSE 182

Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. – M 1 I 1 M 2 M 3 – M 1 M 2 I 2 M 3 • In order to compute the probabilities, we must assign probabilities of transition between states November 09 CSE 182

Profile HMMs • Directed Automaton M with nodes and edges. – Nodes emit symbols according to ‘emission probabilities’ – Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P – Pr[S,P| M ] = Pr[S|P, M ] Pr[P| M ] – Pr[ALIL AND M 1 I 1 M 2 M 3 | M ] = Pr[ALIL| M 1 I 1 M 2 M 3 , M ] Pr[M 1 I 1 M 2 M 3 | M ] • Pr[ALIL | M ] = ? November 09 CSE 182

Formally • The emitted sequence is S=S 1 S 2 …S m • The path traversed is P 1 P 2 P 3 .. • e j (s) = emission probability of symbol s in state P j • Transition probability T[j,k] : Probability of transitioning from state j to state k. • Pr(P,S| M ) = e P1 (S 1 ) T[P 1 ,P 2 ] e P2 (S 2 ) …… • What is Pr(S| M )? November 09 CSE 182

Two solutions • An unknown (hidden) path is traversed to produce (emit) the sequence S. • The probability that M emits S can be either – The sum over the joint probabilities over all paths. • Pr(S|M) = ∑ P Pr(S,P|M) – OR, it is the probability of the most likely path • Pr(S|M) = max P Pr(S,P|M) • Both are appropriate ways to model, and have similar algorithms to solve them. November 09 CSE 182

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) - PowerPoint PPT Presentation

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding November 09 CSE 182 QUIZ! Question: your friend likes to gamble. He tosses a coin: HEADS, he gives you a dollar. TAILS, you give

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Analysing variants with the EBI is an Outstation of the European Molecular Biology Laboratory.

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Use of SAXS in ubiquitin conjugation research Ubiquitin conjugation is a signalling system Like

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology (continued) by

Outline Day 1 & 2 Introduction: The protein structure knowledge gap Recap: Basic

Marketed Medical Products Monique L. Anderson, MD, MHS Assistant Professor of Medicine Division

EVALUATION OF THE IMPACT OF TETRAHYDROPYRIDO[2,1-B][1,3,5]THIADIAZINE DERIVATIVES ON LEPODOVA

Pharmacy 483 MUE Cost Effective Medication Utilization Quality Cost Improvement Management

Sambuz

Useful Links

Newsletter

Mail Us

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) - PowerPoint PPT Presentation

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding November 09 CSE 182 QUIZ! Question: your friend likes to gamble. He tosses a coin: HEADS, he gives you a dollar. TAILS, you give

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast &amp; variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Analysing variants with the EBI is an Outstation of the European Molecular Biology Laboratory.

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Use of SAXS in ubiquitin conjugation research Ubiquitin conjugation is a signalling system Like

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology (continued) by

Outline Day 1 &amp; 2 Introduction: The protein structure knowledge gap Recap: Basic

Marketed Medical Products Monique L. Anderson, MD, MHS Assistant Professor of Medicine Division

EVALUATION OF THE IMPACT OF TETRAHYDROPYRIDO[2,1-B][1,3,5]THIADIAZINE DERIVATIVES ON LEPODOVA

Pharmacy 483 MUE Cost Effective Medication Utilization Quality Cost Improvement Management

Sambuz

Useful Links

Newsletter

Mail Us

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

Outline Day 1 & 2 Introduction: The protein structure knowledge gap Recap: Basic