CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - - PowerPoint PPT Presentation

cse 182 l2 blast variants i
SMART_READER_LITE
LIVE PREVIEW

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - - PowerPoint PPT Presentation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182 Searching Sequence


slide-1
SLIDE 1

FA05 CSE182

CSE 182-L2:Blast & variants I

Dynamic Programming

www. www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna

slide-2
SLIDE 2

FA05 CSE182

Searching Sequence databases

http://www.ncbi.nlm.nih.gov/BLAST/

slide-3
SLIDE 3

FA05 CSE182

Query:

>gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVA LIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTV SVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKT TLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPV SQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRE TVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTT QISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV

  • What is the function of this sequence?
  • Is there a human homolog?
  • Which organelle does it work in? (Secreted/membrane bound)
  • Idea: Search a database of known proteins to see if you can find

similar sequences which have a known function

slide-4
SLIDE 4

FA05 CSE182

Querying with Blast

slide-5
SLIDE 5

FA05 CSE182

Blast Output

  • The output (Blastp query) is a series of protein

sequences, ranked according to similarity with the query

  • Each database hit is aligned to a subsequence of

the query

slide-6
SLIDE 6

FA05 CSE182

Blast Output

query

26 19 405 422

Schematic

db

slide-7
SLIDE 7

FA05 CSE182

Blast Output

Q beg S beg Q end S end S Id

slide-8
SLIDE 8

FA05 CSE182

Interpreting Blast results

  • How do we interpret these results?

– Similar sequence in the 3 species implies that the common ancestor

  • f the 3 had that sequence.

– The sequence accumulates mutations over time. These mutations may be indels, or substitutions. – Hum and mus diverged more recently and so the sequences are more likely to be similar. hum mus dros hummus?

slide-9
SLIDE 9

FA05 CSE182

How do we measure similarity between sequences?

  • Percent identity?

– Hard to compute without indels.

  • Number of sequence edit operations?

Number of sequence edit operations?

  • Implies a notion of alignment.

Implies a notion of alignment. A T C A A T C G T C A A T G G T A T C A A T C G T C A A T G G T

slide-10
SLIDE 10

FA05 CSE182

Computing alignments

  • What is an alignment?
  • 2Xm table.
  • Each sequence is a row, with interspersed gaps
  • Columns describe the edit operations
  • What is the score of an alignment?

What is the score of an alignment?

  • Score every column, and sum up the scores. Let C be the score

Score every column, and sum up the scores. Let C be the score function for the column function for the column

  • How do we compute the alignment with the best score?

How do we compute the alignment with the best score?

A

  • G

C T C A A G G C T

  • A

A

slide-11
SLIDE 11

FA05 CSE182

Optimum scoring alignments, and score of optimum alignment

  • Instead of computing an optimum scoring

alignment, we attempt to compute the score of an

  • ptimal alignment.
  • Later, we will show that the two are equivalent
slide-12
SLIDE 12

FA05 CSE182

Computing Optimal Alignment score

  • Observations: The optimum alignment has nice recursive

properties:

– The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be

  • ptimal sub-alignments.

– The left part contains prefixes s[1..i], and t[1..j]

1 2 1 k t s

slide-13
SLIDE 13

FA05 CSE182

Optimum prefix alignments

  • Consider an optimum alignment of the prefix s[1..i],

and t[1..j]

  • Look at the last cell. It can only have 3

possibilities. 1 k s t

slide-14
SLIDE 14

FA05 CSE182

3 possibilities for rightmost cell

1. s[i] is aligned to t[j]

  • 2. s[i] is aligned to ‘-’
  • 3. t[j] is aligned to ‘-’

s[i] t[j] s[i] t[j]

Optimum alignment of s[1..i-1], and t[1..j-1] Optimum alignment of s[1..i-1], and t[1..j] Optimum alignment of s[1..i], and t[1..j-1]

slide-15
SLIDE 15

FA05 CSE182

Optimal score of an alignment

  • Let S[i,j] be the score of an optimal alignment of the prefix

s[1..i], and t[1..j]. It must be one of 3 possibilities.

s[i] t[j] Optimum alignment of s[1..i-1], and t[1..j-1] s[i]

  • Optimum alignment of s[1..i-1], and t[1..j]
  • Optimum alignment of s[1..i], and t[1..j-1]

t[j]

S[i,j] = C(si,tj)+S(i-1,j-1) S[i,j] = C(si,-)+S(i-1,j) S[i,j] = C(-,tj)+S(i,j-1)

slide-16
SLIDE 16

FA05 CSE182

Optimal alignment score

  • Which prefix pairs (i,j) should we use? For now,

simply use all.

  • If the strings are of length m, and n, respectively,

what is the score of the optimal alignment?

S[i, j] = max S[i −1, j −1]+ C(si,t j) S[i −1, j]+ C(si,−) S[i, j −1]+ C(−,t j)     

slide-17
SLIDE 17

FA05 CSE182

An O(nm) algorithm for score computation

  • The iteration ensures that all values on the right

are computed in earlier steps.

S[i, j] = max S[i −1, j −1]+ C(si,t j) S[i −1, j]+ C(si,−) S[i, j −1]+ C(−,t j)     

For i = 1 to n For j = 1 to m

slide-18
SLIDE 18

FA05 CSE182

Initialization

S[0,0] = 0 S[i,0] = C(si,−) + S[i −1,0] ∀i S[0, j] = C(−,s j) + S[0, j −1] ∀j

slide-19
SLIDE 19

FA05 CSE182

A tableaux approach

s n 1 i 1 j n

M

S[i,j-1] S[i,j] S[i-1,j]

S[i-1,j-1]

t Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells

S[i, j] = max S[i −1, j −1]+ C(si,t j) S[i −1, j]+ C(si,−) S[i, j −1]+ C(−,t j)     

slide-20
SLIDE 20

FA05 CSE182

An Example

  • Align s=TCAT with t=TGCAA
  • Match Score = 1
  • Mismatch score = -1, Indel Score = -1
  • Score A1?, Score A2?

T C A T - T G C A A T C A T T G C A A A1 A2

slide-21
SLIDE 21

FA05 CSE182

1 1

  • 1
  • 2
  • 2
  • 4

1 2

  • 1
  • 1
  • 3
  • 1

1

  • 2
  • 3
  • 2
  • 1

1

  • 1
  • 5
  • 4
  • 3
  • 2
  • 1

T G C A A T C A T

Alignment Table

slide-22
SLIDE 22

FA05 CSE182 1 1

  • 1
  • 2
  • 2
  • 4

1 2

  • 1
  • 1
  • 3
  • 1

1

  • 2
  • 3
  • 2
  • 1

1

  • 1
  • 5
  • 4
  • 3
  • 2
  • 1

T G C A A T C A T

Alignment Table

  • S[4,5] = 1 is the score
  • f an optimum

alignment

  • Therefore, A2 is an
  • ptimum alignment
  • We know how to
  • btain the optimum
  • Score. How do we get

the best alignment?

slide-23
SLIDE 23

FA05 CSE182

Computing Optimum Alignment

  • At each cell, we have 3 choices
  • We maintain additional information to record the choice at

each step.

For i = 1 to n For j = 1 to m

S[i, j] = max S[i −1, j −1]+ C(si,t j) S[i −1, j]+ C(si,−) S[i, j −1]+ C(−,t j)     

If (S[i,j]= S[i-1,j-1] + C(si,tj)) M[i,j] = If (S[i,j]= S[i-1,j] + C(si,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,tj) ) M[i,j] =

j-1 i-1 j i

slide-24
SLIDE 24

FA05 CSE182

T G C A A T C A T 1 1

  • 1
  • 2
  • 2
  • 4

1 2

  • 1
  • 1
  • 3
  • 1

1

  • 2
  • 3
  • 2
  • 1

1

  • 1
  • 5
  • 4
  • 3
  • 2
  • 1

Computing Optimal Alignments

slide-25
SLIDE 25

FA05 CSE182

Retrieving Opt.Alignment

1 1

  • 1
  • 2
  • 2
  • 4

1 2

  • 1
  • 1
  • 3
  • 1

1

  • 2
  • 3
  • 2
  • 1

1

  • 1
  • 5
  • 4
  • 3
  • 2
  • 1

T G C A A T C A T

  • M[4,5]=

Implies that S[4,5]=S[3,4]+C(A,T)

  • r

A T

M[3,4]=

Implies that S[3,4]=S[2,3] +C(A,A)

  • r

A T A A

1 2 3 4 5 1 3 2 4

slide-26
SLIDE 26

FA05 CSE182

Retrieving Opt.Alignment

1 1

  • 1
  • 2
  • 2
  • 4

1 2

  • 1
  • 1
  • 3
  • 1

1

  • 2
  • 3
  • 2
  • 1

1

  • 1
  • 5
  • 4
  • 3
  • 2
  • 1

T G C A A T C A T

  • M[2,3]=

Implies that S[2,3]=S[1,2]+C(C,C)

  • r

A T

M[1,2]=

Implies that S[1,2]=S[1,1] +C(-,G)

  • r

A T A A A A C C C C

  • G

T T

slide-27
SLIDE 27

FA05 CSE182

Algorithm to retrieve optimal alignment

RetrieveAl(i,j) if (M[i,j] == `\’) return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’) return (RetrieveAl (i-1,j) . )

tj si

  • si

tj

  • else if (M[i,j] == `--

else if (M[i,j] == `--’ ’) ) return ( return (RetrieveAl RetrieveAl (i,j-1) . (i,j-1) . ) )

slide-28
SLIDE 28

FA05 CSE182

Summary

  • An optimal alignment of strings of length n

and m can be computed in O(nm) time

  • There is a tight connection between

computation of optimal score, and computation of opt. Alignment

– True for all DP based solutions

slide-29
SLIDE 29

FA05 CSE182

Global versus Local Alignment

Consider s = ACCACCCCTT t = ATCCCCACAT

ACCACCCCTT ACCACCCCTT A TCCCCACAT A TCCCCACAT ATCCCCACAT ATCCCCACAT ACCACCCCT T ACCACCCCT T

slide-30
SLIDE 30

FA05 CSE182

Blast Outputs Local Alignment

query

26 19 405 422

Schematic

db

slide-31
SLIDE 31

FA05 CSE182

Local Alignment

  • Compute maximum

scoring interval over all sub-intervals (a,b), and (a’,b’)

  • How can we compute

this efficiently?

a b a’ b’

slide-32
SLIDE 32

FA05 CSE182

Local Alignment Trick

S[i, j] = max S[i −1, j −1]+ C(si,t j) S[i −1, j]+ C(si,−) S[i, j −1]+ C(−,t j)     

How can we compute the local alignment itself?

slide-33
SLIDE 33

FA05 CSE182

Generalizing Gap Cost

  • It is more

likely for gaps to be contiguous

  • The penalty

for a gap of length l should be go + ge∗ l

slide-34
SLIDE 34

FA05 CSE182

Using affine gap penalties

S[i, j] = maxl S[i −1, j −1]+ C(si,t j) S[i − l, j]+ go + ge* l S[i, j − l]+ go + ge* l     

  • What is the time taken for this?
  • What are the values that l can take?
  • Can we get rid of the extra Dimension?
slide-35
SLIDE 35

FA05 CSE182

Affine gap penalties

  • Define D[i,j] : Score of the best alignment, given that the

final column is a ‘deletion’ (si is aligned to a gap)

  • Define I[i,j]: Score of the best alignment, given that the

final column is an insertion (tj is aligned to a gap)

S[i, j] = max S[i −1, j −1]+ C(si,t j) D[i, j] I[i, j]     

slide-36
SLIDE 36

FA05 CSE182

O(nm) solution for affine gap costs

D[i, j] = max D[i −1, j]+ ge S[i −1, j]+ go + ge    I[i, j] = max I[i, j −1]+ ge S[i, j −1]+ go + ge   

slide-37
SLIDE 37

FA05 CSE182

Blast variants

  • Blastn: DNA query, DNA database
  • Blastp: Protein Query, DNA database
  • Blastx,Tblastn,Tblastx

– all require a 6 frame translation

slide-38
SLIDE 38

FA05 CSE182

The genetic code

  • The DNA can be translated

conceptually using triplets, and looking up the table.

  • How many translations are

possible?

ATCGGATCGTGAT

slide-39
SLIDE 39

FA05 CSE182

Blast variants

  • Blastx: DNA query, protein database
  • Tblastn: Protein query, DNA database
  • Tblastx: DNA query, DNA database

– How is it different from Blastn?

slide-40
SLIDE 40

FA05 CSE182

Summary

  • A critical part of BLAST is local sequence alignment.
  • Optimum Local alignments between two sequences of length

n and m can be computed in O(nm) time.

  • The algorithm can be extended to deal with affine gap

penalties.

  • What is the memory requirement for this computation?
  • Q: If one of the sequences was a large database, would it be

feasible to compute these alignments?

  • Q: What if the query and database were large?