Chapter 7: Rapid alignment methods: FASTA and BLAST The biological - PowerPoint PPT Presentation

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies l FASTA l BLAST l Introduction to bioinformatics, Autumn 2007 117

BLAST: Basic Local Alignment Search Tool BLAST (Altschul et al., 1990) and its variants are some of the l most common sequence search tools in use Roughly, the basic BLAST has three parts: l − 1. Find local alignments between the query sequence and a database sequence (”seed hits”) − 2. Extend seed hits into high-scoring local alignments − 3. Calculate p-values and a rank ordering of the local alignments High-scoring local alignments are called high scoring segment l pairs (HSPs) Gapped BLAST introduced in 1997 allows for gaps in l alignments Introduction to bioinformatics, Autumn 2007 118

Finding seed hits First, we generate a set of neighborhood sequences for given k, l match score matrix and threshold T Neighborhood sequences of a k-word w include all strings of l length k that, when aligned against w, have the alignment score at least T For instance, let I = GCATCGGC, J = CCATCGCCATCG and k l = 5, match score be 1, mismatch score be 0 and T = 4 Introduction to bioinformatics, Autumn 2007 119

Finding seed hits I = GCATCGGC, J = CCATCGCCATCG, k = 5, match score 1, l mismatch score 0, T = 4 This allows for one mismatch in each k-word l The neighborhood of the first k-word of I, GCATC, is GCATC l and the 15 sequences A A C A A CCATC,G GATC,GC GTC,GCA CC,GCAT G T T T G T Introduction to bioinformatics, Autumn 2007 120

Finding seed hits I = GCATCGGC has 4 k-words and thus 4x16 = 64 5-word l patterns (seed hits) to locate in J These patterns can be found using exact search in time l proportional to the sum of pattern lengths + length of J + number of matches (Aho-Corasick algorithm) − Methods for pattern matching are developed on course 58093 String processing algorithms Introduction to bioinformatics, Autumn 2007 121

Extending seed hits: original BLAST • Initial seed hits are extended • Extensions do not add gaps to the alignment • Sequence is extended into a HSP until the alignment score drops below the maximum attained score minus a threshold parameter value • All statistically significant HSPs Extension reported AACCGTTCATTA | || || || Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and TAGCGATCTTTT Lipman,D.J., J. Mol. Biol ., 215, 403-410, 1990 Introduction to bioinformatics, Autumn 2007 122 Initial seed hit

Extending seed hits: gapped BLAST • In later version of BLAST, two seed hits have to be found on the same diagonal – Hits have to be non-overlapping – If the hits are closer than A (additional parameter), then they are joined into a HSP • Threshold value T is lowered to achieve comparable sensitivity • If the resulting HSP achieves a score at least S g , a gapped extension is triggered Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ, Nucleic Acids Res . 1;25(17), 3389-402, 1997 Introduction to bioinformatics, Autumn 2007 123

Gapped extensions of HSPs • Local alignment is performed starting from the HSP • Dynamic programming matrix filled in ”forward” and ”backward” directions (see figure) HSP • Skip cells where value would be X g below the best alignment score found so far Region searched with score Region potentially searched above cutoff parameter by the alignment algorithm Introduction to bioinformatics, Autumn 2007 124

Estimating the significance of results In general, we have a score S(D, X) = s for a l sequence X found in database D BLAST rank-orders the sequences found by p-values l The p-value for this hit is P(S(D, Y) � s) where Y is a l random sequence − Measures the amount of ”surprise” of finding sequence X A smaller p-value indicates more significant hit l − A p-value of 0.1 means that one-tenth of random sequences would have as large score as our result Introduction to bioinformatics, Autumn 2007 125

Estimating the significance of results In BLAST, p-values are computed roughly as follows l There are nm places to begin an optimal alignment in l the n x m alignment matrix Optimal alignment is preceded by a mismatch and has l t matching (identical) letters − (Assume match score 1 and mismatch score 0) Let p = P(two random letters are equal) l The probability of having a mismatch and then t l matches is (1-p)p t Introduction to bioinformatics, Autumn 2007 126

Estimating the significance of results We model this event by a Poisson distribution (why?) l with mean � = nm(1-p)p t P(there is local alignment t or longer) l = 1 – P(no such event) – e �� = 1 – exp(-nm(1-p)p t ) = 1 An equation of the same form is used in Blast: l E-value = P(S(D, Y) � s) � 1 – exp(-nm �� t ) where � > 0 l and 0 < � < 1 Parameters � and � are estimated from data l Introduction to bioinformatics, Autumn 2007 127

Scoring amino acid alignments • We need a way to compute the score S(D, X) for aligning the sequence X against database D • Scoring DNA alignments was discussed previously • Constructing a scoring model for amino acids is more challenging – 20 different amino acids vs. 4 bases • Figure shows the molecular structures of the 20 amino acids http://en.wikipedia.org/wiki/List_of_standard_amino_acids Introduction to bioinformatics, Autumn 2007 128

Scoring amino acid alignments • Substitutions between chemically similar amino acids are more frequent than between dissimilar amino acids • We can check our scoring model against this http://en.wikipedia.org/wiki/List_of_standard_amino_acids Introduction to bioinformatics, Autumn 2007 129

Score matrices Scores s = S(D, X) are obtained from score matrices l Let A = A 1 a 2 …a n and B = b 1 b 2 …b n be sequences of l equal length (no gaps allowed to simplify things) To obtain a score for alignment of A and B, where a i is l aligned against b i , we take the ratio of two probabilities − The probability of having A and B where the characters match (match model M) − The probability that A and B were chosen randomly (random model R) Introduction to bioinformatics, Autumn 2007 130

Score matrices: random model Under the random model, the probability of having X l and Y is where q xi is the probability of occurence of amino acid type x i Position where an amino acid occurs does not affect l its type Introduction to bioinformatics, Autumn 2007 131

Score matrices: match model Let p ab be the probability of having amino acids of type l a and b aligned against each other given they have evolved from the same ancestor c The probability is l Introduction to bioinformatics, Autumn 2007 132

Score matrices: log-odds ratio score We obtain the score S by taking the ratio of these two l probabilities and taking a logarithm of the ratio Introduction to bioinformatics, Autumn 2007 133

Score matrices: log-odds ratio score The score S is obtained by summing over character l pair-specific scores: The probabilities q a and p ab are extracted from data l Introduction to bioinformatics, Autumn 2007 134

Calculating score matrices for amino acids • Probabilities q a are in principle easy to obtain: – Count relative frequencies of every amino acid in a sequence database Introduction to bioinformatics, Autumn 2007 135

Calculating score matrices for amino acids • To calculate p ab we can use a known pool of aligned sequences • BLOCKS is a database of highly Bl Block P ock PR0085 R00851A 1A ID XRODRMPGMNTB; BLOCK conserved regions for proteins AC PR00851A; distance from previous block=(52,131) DE Xeroderma pigmentosum group B protein signature • It lists multiply aligned, BL adapted; width=21; seqs=8; 99.5%=985; strength=1287 XPB_HUMAN|P19447 ( 74) RPLWVAPDGHIFLEAFSPVYK 54 XPB_MOUSE|P49135 ( 74) RPLWVAPDGHIFLEAFSPVYK 54 ungapped and conserved P91579 ( 80) RPLYLAPDGHIFLESFSPVYK 67 XPB_DROME|Q02870 ( 84) RPLWVAPNGHVFLESFSPVYK 79 protein segments RA25_YEAST|Q00578 ( 131) PLWISPSDGRIILESFSPLAE 100 Q38861 ( 52) RPLWACADGRIFLETFSPLYK 71 • Example from BLOCKS shows O13768 ( 90) PLWINPIDGRIILEAFSPLAE 100 O00835 ( 79) RPIWVCPDGHIFLETFSAIYK 86 genes related to human gene associated with DNA-repair defect xeroderma pigmentosum http://blocks.fhcrc.org Introduction to bioinformatics, Autumn 2007 136

BLOSUM matrix RPLWVAPD • BLOSUM is a score matrix for amino acid sequences RPLWVAPR derived from BLOCKS data RPLWVAPN • First, count pairwise PLWISPSD matches f x,y for every amino RPLWACAD acid type pair (x, y) PLWINPID • For example, for column 3 and amino acids L and W, RPIWVCPD we find 8 pairwise matches: f L,W = f W,L = 8 Introduction to bioinformatics, Autumn 2007 137

Creating a BLOSUM matrix RPLWVAPD • Probability p ab is obtained by dividing f ab with the total RPLWVAPR number of pairs (note RPLWVAPN difference with course book): PLWISPSD RPLWACAD PLWINPID RPIWVCPD • We get probabilities q a by Introduction to bioinformatics, Autumn 2007 138

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological - PowerPoint PPT Presentation

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies l FASTA l BLAST l Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul et al., 1990) and

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Lecture 17: Heuristic methods for sequence alignment: BLAST and FASTA Fall 2019 November 14,

Geno2pheno[coreceptor] 3 Geno2pheno[454] Geno2pheno[454] fasta-format sff-, or fasta-format

Heuris'c)search:)FastA)and)BLAST ) COMPSCI)260))Spring)2016 ) Previous)lectures)

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing

FASTA - Pearson and Lipman (88) Earlier version by the same authors, FASTP, appeared in 85

Course contents (18.9.) Biological background (book chapter 1) Probability calculus

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

Diagnostic Testing for Mitochondrial Disease Darius Adams, MD Genetics and Metabolism Atlantic

Secondary structure and amino acid properties Magnus Andersson magnus.andersson@scilifelab.se

Chemistry 2000 Slide Set 20: Organic bases Marc R. Roussel March 26, 2020 Chemistry 2000 Slide

Algorithms in Bioinformatics: A Practical Introduction Introduction to Molecular Biology Outline

Quantifying sequence similarity Analogy Homology Similar function Similar ancestry

Design of artificial tubular protein structures in 3D hexagonal prism lattice under HP model J

CS 220: Discrete Structures and their Applications Counting: the sum and product rules zybooks

EDTA Complexation M +n + Y -4 = MY n-4 K = [MY n-4 ]/[M +n ][Y -4 ] Metal Log K Metal

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological - PowerPoint PPT Presentation

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies l FASTA l BLAST l Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul et al., 1990) and

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Lecture 17: Heuristic methods for sequence alignment: BLAST and FASTA Fall 2019 November 14,

Geno2pheno[coreceptor] 3 Geno2pheno[454] Geno2pheno[454] fasta-format sff-, or fasta-format

Heuris'c)search:)FastA)and)BLAST ) COMPSCI)260))Spring)2016 ) Previous)lectures)

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR &amp; Sequencing

FASTA - Pearson and Lipman (88) Earlier version by the same authors, FASTP, appeared in 85

Course contents (18.9.) Biological background (book chapter 1) Probability calculus

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

Diagnostic Testing for Mitochondrial Disease Darius Adams, MD Genetics and Metabolism Atlantic

Secondary structure and amino acid properties Magnus Andersson magnus.andersson@scilifelab.se

Chemistry 2000 Slide Set 20: Organic bases Marc R. Roussel March 26, 2020 Chemistry 2000 Slide

Algorithms in Bioinformatics: A Practical Introduction Introduction to Molecular Biology Outline

Quantifying sequence similarity Analogy Homology Similar function Similar ancestry

Design of artificial tubular protein structures in 3D hexagonal prism lattice under HP model J

CS 220: Discrete Structures and their Applications Counting: the sum and product rules zybooks

EDTA Complexation M +n + Y -4 = MY n-4 K = [MY n-4 ]/[M +n ][Y -4 ] Metal Log K Metal

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing