Genome 559 Introduction to Statistical and Computational Genomics - PowerPoint PPT Presentation

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST Larry Ruzzo 1

1 minute responses Pacing was: (a) A little slow (1), (b) great (3) [maybe we don’t need semesters after all!], or (c) I was lost/equation-dense (4) (but,I’ll try harder to keep up with reading) Paper slides for note-taking really help. Agreed More time for problems helped. Hopefully again today. Is revised hw schedule on web? Some. Liked it, but need some practice problems for it to sink in. See hw5! Fuzzy on purpose of relative entropy; why does it matter. If motif distribution is like background (low entropy), WMM prediction will be error-prone. Similarly, columns of low entropy may only add noise; at edges, especially, maybe delete them. Didn’t explain substring matches/match objects (2) Today 2

BLAST: Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 The most widely used comp bio tool Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? score-wise, exactly equivalent biologically, later may be more interesting if must miss some, rather miss the former (?) BLAST is a heuristic emphasizing the later speed/sensitivity tradeoff: BLAST Heuristic : A method proceeding may miss weak matches, but towards a solution by trial and error, intuition or loosely gains greatly in speed defined rules. Cf. Algorithm; Smith-Waterman, etc. 3

A Protein Structure: (Dihydrofolate Reductase) 4

BLAST: What Input: a query sequence (say, 50-300 residues) a data base to search for other sequences similar to the query (say, 10 6 - 10 9 residues) a score matrix σ (r,s), giving cost of substituting r for s (& perhaps gap costs) various score thresholds & tuning parameters Output: “all” matches in data base above threshold “E-value” of each 5

BLAST: How Idea: emphasize parts of data base near a good match to some short subword of the query Break query into overlapping words w i of small fixed length (e.g. 3 aa or 11 nt) For each w i , find (empirically, ~50) “neighboring” words v ij with score σ (w i , v ij ) > thresh 1 Look up each v ij in database (via prebuilt index) -- i.e., exact match to short, high-scoring word Extend each such “seed match” (bidirectional) Report those scoring > thresh 2 , calculate E-values 6

BLAST: Example ≥ 7 (thresh 1 ) query deadly � de (11) -> de ee dd dq dk � ea ( 9) -> ea � ad (10) -> ad sd � dl (10) -> dl di dm dv � ly (11) -> ly my iy vy fy lf � ddgearlyk . . . � DB ddge � � 10 � hits ≥ 10 (thresh 2 ) early � 18 �� 7

BLOSUM 62 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

BLAST Refinements “Two hit heuristic” – need 2 nearby, nonoverlapping, gapless hits before trying to extend either “Gapped BLAST” – run heuristic version of Smith- Waterman, bi-directional from hit, until score drops by fixed amount below max PSI-BLAST – For proteins, iterated search, using “weight matrix” pattern from initial pass to find weaker matches in subsequent passes (PSI=pos specific iter) Many others 9

A Likelihood Ratio Defn: two proteins are homologous if they are alike because of shared ancestry; similarity by descent Suppose among proteins overall, residue x occurs with frequency p x Then in a random ungapped alignment of 2 random proteins, you would expect to find x aligned to y with prob p x p y Suppose among homologs , x & y align with prob p xy Are seqs X & Y homologous? Which is log p x i y i ∑ more likely, that the alignment reflects chance or homology? Use a likelihood p x i p y i i ratio test. E.g., BLOSUM62: trusted “homologues” = BLOCKS w/ ≥ 62% identity. 10

Non- ad hoc Alignment Scores Take alignments of homologs and look at frequency of x-y alignments vs freq of x, y overall BLOSUM approach p x y 1 large collection of trusted alignments λ log 2 (the BLOCKS DB) p x p y subsetted by similarity, e.g. BLOSUM62 => 62% identity http://blocks.fhcrc.org/blocks-bin/getblock.pl?IPB013598 11

ad hoc Alignment Scores? Make up any scoring matrix you like Somewhat surprisingly, under pretty general assumptions ** , it is equivalent to the scores constructed as above from some set of probabilities p xy , so you might as well understand what they are NCBI-BLASTN: +1/-2 ↔ 95% identity WU-BLASTN: +5/-4 ↔ 66% identity ** e.g., average scores should be negative, but you probably want that anyway, otherwise local alignments turn into global ones, and some score must be > 0, else best match is empty 12

Summary BLAST is a highly successful search/alignment heuristic. It looks for alignments anchored by short, strong, ungapped “seed” alignments Strengths: Speed, E-values, well-supported implementation & web server Weaknesses: Heuristic search can miss weaker matches 13

Genome 559 Introduction to Statistical and Computational Genomics - PowerPoint PPT Presentation

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST Larry Ruzzo 1 1 minute responses Pacing was: (a) A little slow (1), (b) great (3) [maybe we dont need semesters after all!], or (c) I was

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Heuris'c)search:)FastA)and)BLAST ) COMPSCI)260))Spring)2016 ) Previous)lectures)

Global and local alignments Global vs. local alignments Global: align all nucleotides

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

String comparison problems, Myers (91) So far our goal was to maximize the alignments

Genome 559 Introduction to Statistical and Computational Genomics - PowerPoint PPT Presentation

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST Larry Ruzzo 1 1 minute responses Pacing was: (a) A little slow (1), (b) great (3) [maybe we dont need semesters after all!], or (c) I was

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Heuris'c)search:)FastA)and)BLAST ) COMPSCI)260))Spring)2016 ) Previous)lectures)

Global and local alignments Global vs. local alignments Global: align all nucleotides

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

String comparison problems, Myers (91) So far our goal was to maximize the alignments

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference