sequence comparison sequence comparison significance of
play

Sequence comparison: Sequence comparison: Significance of alignment - PowerPoint PPT Presentation

Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas Unscaled EVD


  1. Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas

  2. Unscaled EVD equation q characteristic width width           x ( ( ) ) 1 1 e P S P S x x e e S is data score, x is test score (FYI this is 1 minus the cumulative peak centered peak centered d density function or CDF) it f ti CDF) on 0

  3. Scaling the EVD g notice that the mode and width of the curves are different curves are different • • An EVD derived from e g the Smith-Waterman algorithm with a An EVD derived from, e.g., the Smith-Waterman algorithm with a given substitution matrix and gap penalties has a characteristic mode μ and scale (width) parameter λ .          ( )         ( x ) ( x ) 1 1 e e scaled: P S x e P S x e  and  depend on the substitution matrix and the gap penalties.

  4. Similar to scaling the standard normal 2 2 Ce  x  PDF snormal snormal   where 1 2 C 2      ( ( ) ) 2 2 x x v v PDF PDF C Ce gnormal   where 1 2 C v  standard is variance, is mean v normal (  moves peak and v adjusts width) PDF = probability density function

  5. An example p You run BLAST and get a maximum match score of 45. You then run BLAST on a shuffled version of the database, and fit an EVD to the resulting empirical distribution The parameters of the EVD are  = 25 resulting empirical distribution. The parameters of the EVD are  = 25 and  = 0.693 . What is the p-value associated with score 45?         0.693 45 25        ( ) 45 45 1 1 e P S P S e e  13.86    ( ) 1 e e  7     9.565 10 1 e   1 0 999999043 1 0.999999043 9.565 10    7 BLAST has precomputed values of  and  for common matrices and gap penalties.

  6. What p-value is significant? What p value is significant? • The most common thresholds are 0.01 and 0.05. • A threshold of 0.05 means you are 95% sure that the result is significant. • Is 95% enough? It depends upon the cost associated Is 95% n h? It d p nds p n th st ss i t d with making a mistake. • Examples of costs: E mp f – Doing extensive wet lab validation (expensive) – Making clinical treatment decisions (very expensive) – Misleading the scientific community (very expensive) – Doing further simple computational tests (cheap) – Telling your grandmother (very cheap) T lli d h ( h )

  7. Multiple testing Multiple testing • Say that you perform a statistical test with a 0.05 y y p f threshold, but you repeat the test on twenty different observations (e.g. 20 different blast runs) • Assume that all of the observations are explainable by the null hypothes s. by the null hypothesis. • What is the chance that at least one of the observations will receive a p-value of 0.05 or less?    20 1 0 95 1 0.95 0 6415 0.6415

  8. Bonferroni correction Bonferroni correction • Assume that individual tests are independent . • Multiply the p-values by the number of tests performed.

  9. Database searching • Say that you search the non-redundant protein d t b database at NCBI, containing roughly one million t NCBI t i i hl illi sequences (i.e. you are doing 10 6 pairwise tests). What p-value threshold should you use? • Say that you want to use a conservative p-value of 0 001 0.001. • Recall that you would observe such a p-value by chance approximately every 1000 times in a random h i t l 1000 ti i d database.

  10. E-values E values • A p-value is the probability of making a mistake. p p y g • An E-value is the expected number of times that the given score would appear in a random database of the given size. i si • One simple way to compute the E-value is to multiply th p the p-value by the number of sequences in the a u y th num r of s qu nc s n th database. • Thus, for a p-value of 0.001 and a database of 1 000 000 1,000,000 sequences, the corresponding E-value is th di E l i 0.001 × 1,000,000 = 1,000. (BLAST actually calculates E-values in a different way, but they mean about the same thing)

  11. Summary • A distribution plots the frequencies of types of observation. • The area under the distribution curve is 1 The area under the distribution curve is 1. • Most statistical tests compare observed data to the expected result according to a null hypothesis. • Sequence alignment scores for unrelated sequences follow an Sequence alignment scores for unrelated sequences follow an extreme value distribution, which is characterized by a long tail. • The p-value associated with a score is the area under the curve to the right of that score to the right of that score. • Selecting a significance threshold requires evaluating the cost of making a mistake. • Bonferroni correction: Multiply the p-value by the number of B f i i M l i l h l b h b f statistical tests performed. • The E-value is the expected number of times that a given score would appear in a randomized database. ld i d i d d t b

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend