CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - PowerPoint PPT Presentation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182

Searching Sequence databases http://www.ncbi.nlm.nih.gov/BLAST/ FA05 CSE182

Query: >gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVA LIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTV SVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKT TLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPV SQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRE TVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTT QISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV • What is the function of this sequence? • Is there a human homolog? • Which organelle does it work in? (Secreted/membrane bound) • Idea: Search a database of known proteins to see if you can find similar sequences which have a known function FA05 CSE182

Querying with Blast FA05 CSE182

Blast Output • The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query • Each database hit is aligned to a subsequence of the query FA05 CSE182

Blast Output Schematic 26 422 query db 19 405 FA05 CSE182

Blast Output S Id Q beg S beg Q end S end FA05 CSE182

Interpreting Blast results • How do we interpret these results? – Similar sequence in the 3 species implies that the common ancestor of the 3 had that sequence. – The sequence accumulates mutations over time. These mutations may be indels, or substitutions. – Hum and mus diverged more recently and so the sequences are more likely to be similar. hum hummus? mus dros FA05 CSE182

How do we measure similarity between sequences? • Percent identity? – Hard to compute without indels. A T C A A T C G A T C A A T C G T C A A T G G T T C A A T G G T Number of sequence edit operations?  Number of sequence edit operations?  Implies a notion of alignment.  Implies a notion of alignment.  FA05 CSE182

Computing alignments • What is an alignment? 2Xm table. • Each sequence is a row, with interspersed gaps • Columns describe the edit operations • A A - T C G G A A C T C G - A What is the score of an alignment? • What is the score of an alignment? • Score every column, and sum up the scores. Let C be the score • Score every column, and sum up the scores. Let C be the score • function for the column function for the column How do we compute the alignment with the best score? • How do we compute the alignment with the best score? • FA05 CSE182

Optimum scoring alignments, and score of optimum alignment • Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment. • Later, we will show that the two are equivalent FA05 CSE182

Computing Optimal Alignment score 1 1 2 k t s • Observations: The optimum alignment has nice recursive properties: – The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be optimal sub-alignments. – The left part contains prefixes s[1..i], and t[1..j] FA05 CSE182

Optimum prefix alignments 1 k s t • Consider an optimum alignment of the prefix s[1..i], and t[1..j] • Look at the last cell. It can only have 3 possibilities. FA05 CSE182

3 possibilities for rightmost cell Optimum alignment of s[1..i-1], and t[1..j-1] s[i] 1. s[i] is aligned to t[j] t[j] Optimum alignment of s[1..i-1], and t[1..j] 2. s[i] is aligned to ‘-’ s[i] 3. t[j] is aligned to ‘-’ t[j] Optimum alignment of s[1..i], and t[1..j-1] FA05 CSE182

Optimal score of an alignment Optimum alignment of s[1..i-1], and t[1..j-1] s[i] S[i,j] = C(s i ,t j )+S(i-1,j-1) t[j] Optimum alignment of s[1..i-1], and t[1..j] s[i] S[i,j] = C(s i ,-)+S(i-1,j) - Optimum alignment of s[1..i], and t[1..j-1] - S[i,j] = C(-,t j )+S(i,j-1) t[j] • Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities. FA05 CSE182

Optimal alignment score  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  • Which prefix pairs (i,j) should we use? For now, simply use all. • If the strings are of length m, and n, respectively, what is the score of the optimal alignment? FA05 CSE182

An O(nm) algorithm for score computation For i = 1 to n For j = 1 to m  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  • The iteration ensures that all values on the right are computed in earlier steps. FA05 CSE182

Initialization S [0,0] = 0 S [ i ,0] = C ( s i , − ) + S [ i − 1,0] ∀ i S [0, j ] = C ( − , s j ) + S [0, j − 1] ∀ j FA05 CSE182

A tableaux approach t 1 j n 1 s M S[i-1,j] S[i-1,j-1]  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  i S[i,j-1] S[i,j] n Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells FA05 CSE182

An Example T C A T - T C A T T G C A A T G C A A A1 A2 • Align s=TCAT with t=TGCAA • Match Score = 1 • Mismatch score = -1, Indel Score = -1 • Score A1?, Score A2? FA05 CSE182

Alignment Table T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182

Alignment Table • S[4,5] = 1 is the score of an optimum T G C A A alignment 0 -1 -2 -3 -4 -5 • Therefore, A2 is an -1 1 0 -1 -2 -3 optimum alignment T • We know how to -2 0 0 1 0 -1 C obtain the optimum -3 -1 -1 0 2 1 A Score. How do we get -4 -2 -2 -1 1 1 the best alignment? T FA05 CSE182

Computing Optimum Alignment • At each cell, we have 3 choices • We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  j-1 j If (S[i,j]= S[i-1,j-1] + C(s i ,t j )) M[i,j] = i-1 If (S[i,j]= S[i-1,j] + C(s i ,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = i FA05 CSE182

Computing Optimal Alignments T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182

Retrieving Opt.Alignment • M[4,5]= 1 2 3 4 5 Implies that T G C A A S[4,5]=S[3,4]+C( A,T ) 0 -1 -2 -3 -4 -5 or A -1 1 0 -1 -2 -3 1 T T -2 0 0 1 0 -1 2 C M[3,4]= -3 -1 -1 0 2 1 3 A Implies that S[3,4]=S[2,3] +C( A,A ) -4 -2 -2 -1 1 1 4 T or A A A T FA05 CSE182

Retrieving Opt.Alignment • M[2,3]= Implies that T G C A A S[2,3]=S[1,2]+C( C,C ) 0 -1 -2 -3 -4 -5 or C A A -1 1 0 -1 -2 -3 T C A T -2 0 0 1 0 -1 C M[1,2]= -3 -1 -1 0 2 1 A Implies that S[1,2]=S[1,1] +C (-,G ) -4 -2 -2 -1 1 1 T or T - C A A T G C A T FA05 CSE182

Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) s i t j return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’) s i - return (RetrieveAl (i-1,j) . ) else if (M[i,j] == `--’ ’) ) else if (M[i,j] == `-- - return (RetrieveAl RetrieveAl (i,j-1) . (i,j-1) . ) ) return ( t j FA05 CSE182

Summary • An optimal alignment of strings of length n and m can be computed in O(nm) time • There is a tight connection between computation of optimal score, and computation of opt. Alignment – True for all DP based solutions FA05 CSE182

Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT ACCACCCCT T ACCACCCCTT ACCACCCCT T ATCCCCACAT A TCCCCACAT ATCCCCACAT A TCCCCACAT FA05 CSE182

Blast Outputs Local Alignment Schematic 26 422 query db 19 405 FA05 CSE182

Local Alignment • Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) a • How can we compute this efficiently? b a’ b’ FA05 CSE182

Local Alignment Trick  0 S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max  S [ i − 1, j ] + C ( s i , − )  S [ i , j − 1] + C ( − , t j )  How can we compute the local alignment itself? FA05 CSE182

Generalizing Gap Cost • It is more likely for gaps to be contiguous • The penalty for a gap of length l should be go + ge ∗ l FA05 CSE182

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - PowerPoint PPT Presentation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182 Searching Sequence

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

Slide 1 / 182 Slide 2 / 182 Algebra Based Physics Kinematics in One Dimension 2015-08-25

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Motion in One Dimension Average Speed Position and Reference Frame Displacement Average

Gene World: A large-scale, gene-centric seman5c web

GRABBAG! STEPHANIE J SPIELMAN, PHD BIO5312, FALL 2017 REGULAR EXPRESSIONS Pattern-based

Algorithms for Analyzing Intraspecific Sequence Variation Srinath Sridhar Computer Science

Mouse Epithelium Dataset S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R Fanny Perraudeau Senior

Data specificities and normalization Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 1-

LAWSCI (2017) http://sciforum.net/conference/mol2net- 03/lawsci-01 Biotechnology in plants

Bioinformatics: Sequence Analysis COMP 571 - Fall 2010 Luay Nakhleh, Rice University Course

G alaxy for G enomics-enabled B reeding Star Yanxin Gao yg28@cornell.edu Introduction

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - PowerPoint PPT Presentation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182 Searching Sequence

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

CSE 182-L2:Blast &amp; variants I Dynamic Programming FA08 CSE182 Notes

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR &amp; Sequencing

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

Slide 1 / 182 Slide 2 / 182 Algebra Based Physics Kinematics in One Dimension 2015-08-25

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Motion in One Dimension Average Speed Position and Reference Frame Displacement Average

Gene World: A large-scale, gene-centric seman5c web

GRABBAG! STEPHANIE J SPIELMAN, PHD BIO5312, FALL 2017 REGULAR EXPRESSIONS Pattern-based

Algorithms for Analyzing Intraspecific Sequence Variation Srinath Sridhar Computer Science

Mouse Epithelium Dataset S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R Fanny Perraudeau Senior

Data specificities and normalization Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 1-

LAWSCI (2017) http://sciforum.net/conference/mol2net- 03/lawsci-01 Biotechnology in plants

Bioinformatics: Sequence Analysis COMP 571 - Fall 2010 Luay Nakhleh, Rice University Course

G alaxy for G enomics-enabled B reeding Star Yanxin Gao yg28@cornell.edu Introduction

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing