CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182

Expectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09 CSE 182

Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA – Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. – “One size does not fit all” – Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant – Different proteins might evolve at different rates and we need to normalize for that 3 October 09 CSE 182

Frequency based scoring A B • Our goal is to score each column in the alignment • Comparing against expectation: – Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance P R (A,B) – Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) P O (A,B) • A good score function?   log P O ( A , B )   P R ( A , B )   October 09 CSE 182

Log-odds scoring • Log-odds score makes sense. • It is also sensitive to evolution • However, to compute a log-odds score function you need good alignments • To get good alignments of sequences, you need a (log-odds) score function. October 09 CSE 182

PAM 1 distance • Define: Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 6 October 09 CSE 182

PAM1 matrix • Align many proteins that are very similar – Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency P b|a of residue a being substituted by residue b. • PAM1(a,b) = P a|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix – S(a,b) = log 10 (P ab /P a P b ) = log 10 (P b|a /P b ) 7 October 09 CSE 182

PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b) 8 October 09 CSE 182

• For closely related sequences (1PAM) apart, we can make a set of alignments, and use that to compute an appropriate evolutionary distance. • What do we do for higher PAM sequences? October 09 CSE 182

PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 1 PAM 2 PAM 1 PAM 10 October 09 CSE 182

Generating Higher PAMs • PAM 2 (a,b) = ∑ c PAM 1 (a,c). PAM 1 (c,b) • PAM 2 = PAM 1 * PAM 1 (Matrix multiplication) • PAM 250 – = PAM 1 *PAM 249 – = PAM 1 250 b b c a a = c PAM 1 PAM 2 PAM 1 11 October 09 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power? 12 October 09 CSE 182

Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log 10 (P ab /P a P b )= log 10 (P a|b /P a ) = log 10 (PAM 250 (a,b)/P a ) • How does it help? hum – S 250 (A,V) >> S 1 (A,V) – Scoring of hum vs. Dros should be mus using a higher PAM matrix than scoring hum vs. mus. – An alignment with a smaller % identity dros could still have a higher score and be more significant 13 October 09 CSE 182

PAM250 based scoring matrix • S 250 (a,b) = log 10 (P ab /P a P b ) = log 10 (PAM250(a,b)/P a ) 14 October 09 CSE 182

BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. – In practice BLOSUM62 seems to work very well. 15 October 09 CSE 182

PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 16 October 09 CSE 182

P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value? October 09 CSE 182

What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers October 09 CSE 182

• End of L5 October 09 CSE 182

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 - PowerPoint PPT Presentation

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some quantities can be reasonably guessed by taking a statistical sample, others not Average weight of a group of 100 people Average height of a group

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

The Birth of HPC Cuba How supercomputing is being made available to all Cuban researchers using

RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion

Graph-theore*c algorithms to improve phylogenomic analyses Tandy Warnow and Pranjal Vachaspa3

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson

Comparing cancer models using gene expression of genetic pathways and other gene lists Tauno

Optimally Solving Hard Combinatorial Problems in Computational Biology Falk Hffner Institut

Algorithms for the validation and correction of gene relations Manuel Lafond, Universit de

Morphisms of Reaction Networks Luca Cardelli, Microsoft Research & Oxford University with:

Sambuz

Useful Links

Newsletter

Mail Us