Sequence comparison: Sequence comparison: Significance of alignment scores
http://faculty.washington.edu/jht/GS559_2014/
Genome 559: Introduction to Statistical d C i l G i and Computational Genomics
- Prof. James H. Thomas
Sequence comparison: Sequence comparison: Significance of alignment - - PowerPoint PPT Presentation
Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas Unscaled EVD
characteristic width width
x
peak centered (FYI this is 1 minus the cumulative d it f ti CDF) peak centered
density function or CDF)
notice that the mode and width of the curves are different
curves are different
given substitution matrix and gap penalties has a characteristic mode μ and scale (width) parameter λ.
( )
( )
x
e
( )
x
e
scaled:
and depend on the substitution matrix and the gap penalties.
2 2 snormal
where 1 2
snormal
C
2
( ) 2 x v
( ) 2
where 1 2
x v gnormal
C v
standard normal is variance, is mean v ( moves peak and v adjusts width) PDF = probability density function
You run BLAST and get a maximum match score of 45. You then run BLAST on a shuffled version of the database, and fit an EVD to the resulting empirical distribution The parameters of the EVD are = 25 resulting empirical distribution. The parameters of the EVD are = 25 and = 0.693. What is the p-value associated with score 45?
0.693 45 25
( )
e
13.86
( )
e
7
9.565 10
7
BLAST has precomputed values of and for common matrices and gap penalties.
– Doing extensive wet lab validation (expensive) – Making clinical treatment decisions (very expensive) – Misleading the scientific community (very expensive) – Doing further simple computational tests (cheap) T lli d h ( h ) – Telling your grandmother (very cheap)
20
(BLAST actually calculates E-values in a different way, but they mean about the same thing)
The area under the distribution curve is 1.
result according to a null hypothesis. Sequence alignment scores for unrelated sequences follow an
extreme value distribution, which is characterized by a long tail.
to the right of that score to the right of that score.
B f i i M l i l h l b h b f
statistical tests performed.
ld i d i d d t b would appear in a randomized database.