Sequence comparison: Sequence comparison: Significance of alignment - PowerPoint PPT Presentation

Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas

Unscaled EVD equation q characteristic width width           x ( ( ) ) 1 1 e P S P S x x e e S is data score, x is test score (FYI this is 1 minus the cumulative peak centered peak centered d density function or CDF) it f ti CDF) on 0

Scaling the EVD g notice that the mode and width of the curves are different curves are different • • An EVD derived from e g the Smith-Waterman algorithm with a An EVD derived from, e.g., the Smith-Waterman algorithm with a given substitution matrix and gap penalties has a characteristic mode μ and scale (width) parameter λ .          ( )         ( x ) ( x ) 1 1 e e scaled: P S x e P S x e  and  depend on the substitution matrix and the gap penalties.

Similar to scaling the standard normal 2 2 Ce  x  PDF snormal snormal   where 1 2 C 2      ( ( ) ) 2 2 x x v v PDF PDF C Ce gnormal   where 1 2 C v  standard is variance, is mean v normal (  moves peak and v adjusts width) PDF = probability density function

An example p You run BLAST and get a maximum match score of 45. You then run BLAST on a shuffled version of the database, and fit an EVD to the resulting empirical distribution The parameters of the EVD are  = 25 resulting empirical distribution. The parameters of the EVD are  = 25 and  = 0.693 . What is the p-value associated with score 45?         0.693 45 25        ( ) 45 45 1 1 e P S P S e e  13.86    ( ) 1 e e  7     9.565 10 1 e   1 0 999999043 1 0.999999043 9.565 10    7 BLAST has precomputed values of  and  for common matrices and gap penalties.

What p-value is significant? What p value is significant? • The most common thresholds are 0.01 and 0.05. • A threshold of 0.05 means you are 95% sure that the result is significant. • Is 95% enough? It depends upon the cost associated Is 95% n h? It d p nds p n th st ss i t d with making a mistake. • Examples of costs: E mp f – Doing extensive wet lab validation (expensive) – Making clinical treatment decisions (very expensive) – Misleading the scientific community (very expensive) – Doing further simple computational tests (cheap) – Telling your grandmother (very cheap) T lli d h ( h )

Multiple testing Multiple testing • Say that you perform a statistical test with a 0.05 y y p f threshold, but you repeat the test on twenty different observations (e.g. 20 different blast runs) • Assume that all of the observations are explainable by the null hypothes s. by the null hypothesis. • What is the chance that at least one of the observations will receive a p-value of 0.05 or less?    20 1 0 95 1 0.95 0 6415 0.6415

Bonferroni correction Bonferroni correction • Assume that individual tests are independent . • Multiply the p-values by the number of tests performed.

Database searching • Say that you search the non-redundant protein d t b database at NCBI, containing roughly one million t NCBI t i i hl illi sequences (i.e. you are doing 10 6 pairwise tests). What p-value threshold should you use? • Say that you want to use a conservative p-value of 0 001 0.001. • Recall that you would observe such a p-value by chance approximately every 1000 times in a random h i t l 1000 ti i d database.

E-values E values • A p-value is the probability of making a mistake. p p y g • An E-value is the expected number of times that the given score would appear in a random database of the given size. i si • One simple way to compute the E-value is to multiply th p the p-value by the number of sequences in the a u y th num r of s qu nc s n th database. • Thus, for a p-value of 0.001 and a database of 1 000 000 1,000,000 sequences, the corresponding E-value is th di E l i 0.001 × 1,000,000 = 1,000. (BLAST actually calculates E-values in a different way, but they mean about the same thing)

Summary • A distribution plots the frequencies of types of observation. • The area under the distribution curve is 1 The area under the distribution curve is 1. • Most statistical tests compare observed data to the expected result according to a null hypothesis. • Sequence alignment scores for unrelated sequences follow an Sequence alignment scores for unrelated sequences follow an extreme value distribution, which is characterized by a long tail. • The p-value associated with a score is the area under the curve to the right of that score to the right of that score. • Selecting a significance threshold requires evaluating the cost of making a mistake. • Bonferroni correction: Multiply the p-value by the number of B f i i M l i l h l b h b f statistical tests performed. • The E-value is the expected number of times that a given score would appear in a randomized database. ld i d i d d t b

Sequence comparison: Sequence comparison: Significance of alignment - PowerPoint PPT Presentation

Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas Unscaled EVD

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

The Significance of The Significance of Sustainable Sustainable Development in in Development

Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up Josep Torrellas

The Shift to to Bachelor Bachelor/Master /Master Programs Programs The Shift in Germany and

Learning Cloud Dynamics to Optimize Spot Instance Bidding Strategies Misha Khodak Joint with

Contravariant: The Other Side of the Coin George Wilson Data61/CSIRO

A human-inspired Approach Matteo Bianchi 1,2 with Antonio Bicchi, Paolo Salaris, Manuel G.

From the Baby Blues to Postpartum Depression, How to Recognize and Refer Dr. Meg Earls, Psy.D.

1 The Worlds Undersea Data Networks Multi-Hop Networks How to deliver data

1 The Undersea Network Examples of Switches 802.11 Alcatel 7670 RSP access point TX8 Juniper

Sambuz

Useful Links

Newsletter

Mail Us

Sequence comparison: Sequence comparison: Significance of alignment - PowerPoint PPT Presentation

Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas Unscaled EVD

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

The Significance of The Significance of Sustainable Sustainable Development in in Development

Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up Josep Torrellas

The Shift to to Bachelor Bachelor/Master /Master Programs Programs The Shift in Germany and

Learning Cloud Dynamics to Optimize Spot Instance Bidding Strategies Misha Khodak Joint with

Contravariant: The Other Side of the Coin George Wilson Data61/CSIRO

A human-inspired Approach Matteo Bianchi 1,2 with Antonio Bicchi, Paolo Salaris, Manuel G.

From the Baby Blues to Postpartum Depression, How to Recognize and Refer Dr. Meg Earls, Psy.D.

1 The Worlds Undersea Data Networks Multi-Hop Networks How to deliver data

1 The Undersea Network Examples of Switches 802.11 Alcatel 7670 RSP access point TX8 Juniper

Sambuz

Useful Links

Newsletter

Mail Us

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or