Sequence comparison: Sequence comparison: Significance of alignment - - PowerPoint PPT Presentation

sequence comparison sequence comparison significance of
SMART_READER_LITE
LIVE PREVIEW

Sequence comparison: Sequence comparison: Significance of alignment - - PowerPoint PPT Presentation

Sequence comparison: Sequence comparison: Significance of alignment scores http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics d C i l G i Prof. James H. Thomas Unscaled EVD


slide-1
SLIDE 1

Sequence comparison: Sequence comparison: Significance of alignment scores

http://faculty.washington.edu/jht/GS559_2014/

Genome 559: Introduction to Statistical d C i l G i and Computational Genomics

  • Prof. James H. Thomas
slide-2
SLIDE 2

Unscaled EVD equation q

characteristic width width

 

( )

1

x

e

P S x e

 

( ) S is data score, x is test score

1 P S x e   

peak centered (FYI this is 1 minus the cumulative d it f ti CDF) peak centered

  • n 0

density function or CDF)

slide-3
SLIDE 3

Scaling the EVD

notice that the mode and width of the curves are different

g

curves are different

  • An EVD derived from e g the Smith-Waterman algorithm with a
  • An EVD derived from, e.g., the Smith-Waterman algorithm with a

given substitution matrix and gap penalties has a characteristic mode μ and scale (width) parameter λ.

 

( )

( )

1

x

e

P S x e

   

  

 

( )

1

x

e

P S x e

  

scaled:

 and  depend on the substitution matrix and the gap penalties.

slide-4
SLIDE 4

Similar to scaling the standard normal

2 2 snormal

x PDF Ce 

where 1 2

snormal

C  

2

( ) 2 x v

PDF C

   ( ) 2

where 1 2

x v gnormal

C v

PDF Ce

 

standard normal is variance, is mean v  ( moves peak and v adjusts width) PDF = probability density function

slide-5
SLIDE 5

An example p

You run BLAST and get a maximum match score of 45. You then run BLAST on a shuffled version of the database, and fit an EVD to the resulting empirical distribution The parameters of the EVD are  = 25 resulting empirical distribution. The parameters of the EVD are  = 25 and  = 0.693. What is the p-value associated with score 45?

 

 

0.693 45 25

( )

45 1

e

P S e

 

  

 

13.86

( )

45 1 1

e

P S e e

    

7

9.565 10

1 1 0 999999043 e

 

 

7

1 0.999999043 9.565 10    

BLAST has precomputed values of  and  for common matrices and gap penalties.

slide-6
SLIDE 6

What p-value is significant? What p value is significant?

  • The most common thresholds are 0.01 and 0.05.
  • A threshold of 0.05 means you are 95% sure that the

result is significant. Is 95% n h? It d p nds p n th st ss i t d

  • Is 95% enough? It depends upon the cost associated

with making a mistake.

  • Examples of costs:

E mp f

– Doing extensive wet lab validation (expensive) – Making clinical treatment decisions (very expensive) – Misleading the scientific community (very expensive) – Doing further simple computational tests (cheap) T lli d h ( h ) – Telling your grandmother (very cheap)

slide-7
SLIDE 7

Multiple testing Multiple testing

  • Say that you perform a statistical test with a 0.05

y y p f threshold, but you repeat the test on twenty different observations (e.g. 20 different blast runs)

  • Assume that all of the observations are explainable

by the null hypothesis. by the null hypothes s.

  • What is the chance that at least one of the
  • bservations will receive a p-value of 0.05 or less?

20

1 0 95 0 6415   1 0.95 0.6415 

slide-8
SLIDE 8

Bonferroni correction Bonferroni correction

  • Assume that individual tests are independent.
  • Multiply the p-values by the number of tests

performed.

slide-9
SLIDE 9

Database searching

  • Say that you search the non-redundant protein

d t b t NCBI t i i hl illi database at NCBI, containing roughly one million sequences (i.e. you are doing 106 pairwise tests). What p-value threshold should you use?

  • Say that you want to use a conservative p-value of

0 001 0.001.

  • Recall that you would observe such a p-value by

h i t l 1000 ti i d chance approximately every 1000 times in a random database.

slide-10
SLIDE 10

E-values E values

  • A p-value is the probability of making a mistake.

p p y g

  • An E-value is the expected number of times that the

given score would appear in a random database of the i si given size.

  • One simple way to compute the E-value is to multiply

the p-value by the number of sequences in the th p a u y th num r of s qu nc s n th database.

  • Thus, for a p-value of 0.001 and a database of

1 000 000 th di E l i 1,000,000 sequences, the corresponding E-value is 0.001 × 1,000,000 = 1,000.

(BLAST actually calculates E-values in a different way, but they mean about the same thing)

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Summary

  • A distribution plots the frequencies of types of observation.
  • The area under the distribution curve is 1

The area under the distribution curve is 1.

  • Most statistical tests compare observed data to the expected

result according to a null hypothesis. Sequence alignment scores for unrelated sequences follow an

  • Sequence alignment scores for unrelated sequences follow an

extreme value distribution, which is characterized by a long tail.

  • The p-value associated with a score is the area under the curve

to the right of that score to the right of that score.

  • Selecting a significance threshold requires evaluating the cost
  • f making a mistake.

B f i i M l i l h l b h b f

  • Bonferroni correction: Multiply the p-value by the number of

statistical tests performed.

  • The E-value is the expected number of times that a given score

ld i d i d d t b would appear in a randomized database.