Sequence comparison: Significance of similarity scores - PowerPoint PPT Presentation

Sequence comparison: Significance of similarity scores http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review • How to compute and use a score matrix. • log-odds of sum-of-pair counts vs. expected counts in aligned blocks. • Why gap scores should be affine.

Are these proteins related? (intuitive answers) SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY NO (score = -1) L P L Y N Y C L identities-> SEQ 2: QFFPLMPPAPYFILATDYENLPLVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY PROBABLY (score = 15) L P W LDATYKNYA Y C L SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY YES (score = 24) RVV L PS W LDATYKNYA Y CDVTYKL SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF

Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Alignment algorithm and 45 score matrix Low score = unrelated High score = related HADKRAHSIHAWLLSKSKVLGNTKEVVQNVLKS How high is high enough?

The null hypothesis • We first characterize the distribution of scores expected from sequences that are not related. • This assumption is called the null hypothesis. • The statistical test will be to determine whether the observed result provides a reason to reject the null hypothesis.

Sequence alignment score distribution Frequency Sequence alignment score • Use BLAST to search a randomly generated database of sequences using a given query sequence (recall that BLAST searches use DP local alignment). • What will be the form of the resulting distribution of pairwise alignment scores?

Empirical score distribution • Distribution of scores from a real database search using BLAST. Frequency • This distribution contains scores from a few related and lots of unrelated pairs. Score High scores from related sequences (note - there are lots of lower scoring alignments not reported)

Empirical null score distribution 1,685 scores • This distribution is similar to the previous one, but generated using a randomized sequence database (each sequence shuffled). (notice the x scale is shorter here) (note - there are lots of lower scoring alignments not reported)

Computing an empirical p-value • The probability of observing a score >=X is the area under the 'curve' to the right of X. • This probability is called a p-value. • p-value = Pr(data|null) (read as probability of data given a null hypothesis) e.g. out of 1,685 scores, 28 received a score of 20 or better. Thus, the p-value associated with a score of 20 is approximately 28/1685 = 0.0166.

Problems with empirical distributions • We are interested in very small probabilities. • These are computed from the tail of the null distribution. • Estimating a distribution with an accurate tail is feasible but computationally very expensive because we have to make a very large number of alignments.

A solution • Solution: characterize the form of the score distribution mathematically. • Fit the parameters of the distribution empirically (or compute them analytically if possible). • Use the resulting distribution to compute accurate p-values. • First solved by Karlin and Altschul.

Extreme value distribution (EVD) (aka Gumbel Distribution) This distribution is roughly normal near the peak, but has a longer tail on the right.

Computing a p-value • The probability of observing a score >=4 is the area under the curve to the right of 4. • p-value = Pr(data|null)

Unscaled EVD equation Compute this value for x=4. x ( e ) 1 P S x e S is data score, x is test score

Computing a p-value 4 ( e ) 4 1 P S e ( 4) 0.018149 P S

Other comments on probability distributions (FYI) • the PDF (probability density function) is the equation that generates the probability curve. • the CDF (cumulative distribution function) is the equation that describes the total area under the probability curve up to some point (inuitively the "area so far"). • for alignment scores we are interested in the area above some point. But since the total area under the curve is exactly 1, this is just 1 - CDF . • for the unscaled extreme value distribution (Gumbel): x ( e ) x CDF e x ( e ) PDF e e • and we want to compute 1 - CDF : x ( e ) 1 P S x e

Sequence comparison: Significance of similarity scores - PowerPoint PPT Presentation

Sequence comparison: Significance of similarity scores http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Review How to compute and use a score matrix.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Sequence comparison: Sequence comparison: Significance of alignment scores

Sequence Comparison: Significance of similarity scores Genome 373 Genomic Informatics Elhanan

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

A quick review Significance of similarity scores (P-values) Empirical null score

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th,

Approximate Bayesian Computation using Auxiliary Models Tony Pettitt Co-authors Chris Drovandi,

Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University

Improved Measurements of the -Decay Response of Liquid Xenon with the LUX Detector Jon Balajthy

Lecture 11. 100 years events - extreme loads Igor Rychlik Chalmers Department of Mathematical

Direct Optimization CSC2547 Adamo Young, Dami Choi, Sepehr Abbasi Zadeh Direct Optimization

studying the extreme rainfalls in Mediterranean Georgia Lazoglou : PhD student of Climatology, in

Multifractality and extreme value statistics O. Giraud LPTMS - CNRS and Universit Paris Sud,

Sequence comparison: Significance of similarity scores - PowerPoint PPT Presentation

Sequence comparison: Significance of similarity scores http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Review How to compute and use a score matrix.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Sequence comparison: Sequence comparison: Significance of alignment scores

Sequence Comparison: Significance of similarity scores Genome 373 Genomic Informatics Elhanan

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

A quick review Significance of similarity scores (P-values) Empirical null score

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th,

Approximate Bayesian Computation using Auxiliary Models Tony Pettitt Co-authors Chris Drovandi,

Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University

Improved Measurements of the -Decay Response of Liquid Xenon with the LUX Detector Jon Balajthy

Lecture 11. 100 years events - extreme loads Igor Rychlik Chalmers Department of Mathematical

Direct Optimization CSC2547 Adamo Young, Dami Choi, Sepehr Abbasi Zadeh Direct Optimization

studying the extreme rainfalls in Mediterranean Georgia Lazoglou : PhD student of Climatology, in

Multifractality and extreme value statistics O. Giraud LPTMS - CNRS and Universit Paris Sud,

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or