Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio - PowerPoint PPT Presentation

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing Autumn 2009 3: BLAST, Alignment score significance; PCR and DNA sequencing 2 BLAST: BLAST: What Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 Input: The most widely used comp bio tool A query sequence (say, 300 residues) Which is better: long mediocre match or a few nearby, A data base to search for other sequences similar to the query short, strong matches with the same total score? (say, 10 6 - 10 9 residues) score-wise, exactly equivalent A score matrix ! (r,s), giving cost of substituting r for s (& perhaps biologically, later may be more interesting, & is common gap costs) at least, if must miss some, rather miss the former Various score thresholds & tuning parameters BLAST is a heuristic emphasizing the later Output: speed/sensitivity tradeoff: BLAST may miss former, but gains “All” matches in data base above threshold greatly in speed “E-value” of each 6 7

BLAST: How BLAST: Example Idea: most interesting parts of DB are those with a good " 7 (thresh 1 ) query deadly � ungapped match to some short subword of the query Break query into overlapping words w i of small fixed de (11) -> de ee dd dq dk � length (e.g. 3 aa or 11 nt) ea ( 9) -> ea � For each w i , find (empirically, ~50) “neighboring” words v ij ad (10) -> ad sd � v ij w i with score ! (w i , v ij ) > thresh 1 dl (10) -> dl di dm dv � Look up each v ij in database (via prebuilt index) -- ly (11) -> ly my iy vy fy lf � i.e., exact match to short, high-scoring word ddgearlyk . . . � DB Extend each such “seed match” (bidirectional) Report those scoring > thresh 2 , calculate E-values ddge � � 10 � hits " 10 (thresh 2 ) early � 18 �� 8 9 BLOSUM 62 BLAST Refinements A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 “Two hit heuristic” -- need 2 nearby, nonoverlapping, N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 gapless hits before trying to extend either C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 “Gapped BLAST” -- run heuristic version of Smith- E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 Waterman, bi-directional from hit, until score drops by G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 fixed amount below max I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 PSI-BLAST -- For proteins, iterated search, using K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 “weight matrix” pattern from initial pass to find weaker F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 matches in subsequent passes P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 Many others W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 10 11 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Hypothesis Testing: Significance of Alignments A Very Simple Example Is “42” a good score? Given: A coin, either fair (p(H)=1/2) or biased (p(H)=2/3) Decide: which Compared to what? How? Flip it 5 times. Suppose outcome D = HHHTH Null Model/Null Hypothesis M 0 : p(H)=1/2 Usual approach: compared to a specific “null model”, Alternative Model/Alt Hypothesis M 1 : p(H)=2/3 such as “random sequences” Likelihoods: P(D | M 0 ) = (1/2) (1/2) (1/2) (1/2) (1/2) = 1/32 P(D | M 1 ) = (2/3) (2/3) (2/3) (1/3) (2/3) = 16/243 p ( D | M 1 ) p ( D | M 0 ) = 16/ 243 1/ 32 = 512 243 " 2.1 Likelihood Ratio: I.e., alt model is # 2.1x more likely than null model, given data 12 13 null p-value Hypothesis Testing, II p-values obs Log of likelihood ratio is equivalent, often more The p-value of such a test is the probability, assuming that the null model is true, of seeing data as extreme or more extreme than convenient what you actually observed add logs instead of multiplying… E.g., we observed 4 heads; p-value is prob of seeing 4 or 5 heads “Likelihood Ratio Tests”: reject null if LLR > threshold in 5 tosses of a fair coin LLR > 0 disfavors null, but higher threshold gives stronger Why interesting? It measures probability that we would be making evidence against a mistake in rejecting null . Neyman-Pearson Theorem: For a given error rate, LRT Can analytically find p-value for simple problems like coins; often turn to simulation/permutation tests (introduced earlier) or to is as good a test as any (subject to some fine print). approximation (coming soon) for more complex situations Usual scientific convention is to reject null only if p-value is < 0.05; sometimes demand p << 0.05 (esp. if estimates are inaccurate) 14 15

A Likelihood Ratio Non- ad hoc Alignment Scores Take alignments of homologs and look at frequency of Defn: two proteins are homologous if they are alike because of shared ancestry; similarity by descent x-y alignments vs freq of x, y overall Issues Suppose among proteins overall, residue x occurs with frequency p x biased samples Then in a random alignment of 2 random proteins, you would expect to evolutionary distance find x aligned to y with prob p x p y Suppose among homologs , x & y align with prob p xy BLOSUM approach p x y Are seqs X & Y homologous? Which is log p x i y i 1 Large collection of trusted alignments more likely, that the alignment reflects " " log 2 (the BLOCKS DB) chance or homology? Use a likelihood p x p y Subset by similarity p x i p y i ratio test. i BLOSUM62 ⇒ ! 62% identity e.g. http://blocks.fhcrc.org/blocks-bin/getblock.pl?IPB013598 16 17 BLOSUM 62 ad hoc Alignment Scores? A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 Make up any scoring matrix you like N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 Somewhat surprisingly, under pretty general C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 assumptions ** , it is equivalent to the scores constructed Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 as above from some set of probabilities p xy , so you G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 might as well understand what they are H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 NCBI-BLAST: +1/-2 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 WU-BLAST: +5/-4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 ** e.g., average scores should be negative, but you probably want S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 that anyway, otherwise local alignments turn into global ones, and T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 some score must be > 0, else best match is empty W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 18 19 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Random (ungapped) local alignment Alignment Scores vs Test Statistic m Alignment alg works hard to contort data into a high-scoring alignment Goal of test statistic is to discriminate good/bad ones Why use same score? Doesn’t a better alg just push up scores? Maybe better to test via an independent criterion? A: Yes, better alg may raise background scores. But , want best discrimination in both phases, so use best possible score/test n statistic, with appropriate threshold, rather than an indp. criterion Note: best random match looks like real match (e.g. same matching-letter frequencies), except for score. One reason to score/test differently–if score is too expensive for search, might try search w/ approx score, look at multiple hits it’s max of m*n ~indp random scores 20 21 Overall Alignment Significance, I Normal EVD A Theoretical Approach: EVD 0.4 0.4 Let X i , 1 ! i ! N , be indp. random variables drawn from some (non- pathological) distribution 0.3 0.3 Q. what can you say about distribution of y = sum{ X i } ? A. y is approximately normally distributed 0.2 0.2 Q. what can you say about distribution of y = max{ X i } ? A. it’s approximately an Extreme Value Distribution (EVD) [one of only 3 kinds; for our purposes, the relevant one is:] 0.1 0.1 P ( y " z ) # exp( $ KNe $ % ( z $ µ ) ) (*) 0.0 0.0 For ungapped local alignment of seqs x, y, N ~ |x|*|y| $ , K depend on scores, etc., or can be estimated by curve-fitting -4 -2 0 2 4 -4 -2 0 2 4 random scores to (*). (cf. reading) x x 22 23

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio - PowerPoint PPT Presentation

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing Autumn 2009 3: BLAST, Alignment score significance; PCR and DNA sequencing 2 BLAST: BLAST: What Basic Local Alignment Search Tool Altschul,

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

A2/A2SB GWAC Industry Meeting November 19, 2014 GSA Alliant 2 Speakers Primary Point of Contact

RECAP But So far we have seen: How do we make sure the decompositions are lossless

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Sam - samb0303 Blake - blakec20 Tuan - tuanvo Michelle - parkm23 Basic idea Centralized (host)

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

HIGH-ENERGY HADRON-NUCLEUS COLLISIONS MEASURED WITH ALICE MPI@LHC TRIESTE, 23-27 NOVEMBER 2015

Future silicon trackers: 4D tracking, very high fluences, very small pixels Nicol Cartiglia

Observation of Gravitational Waves from a Binary Black Hole Merger In LIGO Hanford and Livingston