CS481: Bioinformatics Algorithms
Can Alkan EA224 calkan@cs.bilkent.edu.tr
http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/
CS481: Bioinformatics Algorithms Can Alkan EA224 - - PowerPoint PPT Presentation
CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/ Outline DNA Sequence Comparison Change Problem Manhattan Tourist Problem Longest Paths in Graphs
http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/
DNA Sequence Comparison Change Problem Manhattan Tourist Problem Longest Paths in Graphs Sequence Alignment Edit Distance Longest Common Subsequence Problem Dot Matrices
Gene similarities between two genes with
Computing a similarity score between two
Dynamic programming is a technique for
The Change Problem is a good problem to
1 2 3 4 5 6 7 8 9 10 1 1 1 Value Min # of coins
1 2 3 4 5 6 7 8 9 10 1 2 1 2 1 2 2 2 Value Min # of coins
1 2 3 4 5 6 7 8 9 10 1 2 1 2 1 2 3 2 3 2 Value Min # of coins
minNumCoins(M) = minNumCoins(M-1) + 1 minNumCoins(M-3) + 1 minNumCoins(M-5) + 1 min of
minNumCoins(M) = minNumCoins(M-c1) + 1 minNumCoins(M-c2) + 1 … minNumCoins(M-cd) + 1 min of
1. 1.
Re Recu cursiveChange siveChange(M,c,d)
2. 2.
if if M = 0
3. 3.
ret eturn rn 0
4.
bestNumCoins NumCoins infinity
5.
for i 1 to d
6. 6.
if if M ≥ ci
7. 7.
numCoins Re Recu cursiveChange iveChange(M – ci , c, d)
8.
if if numCoins + 1 < bestNumCoins
9. 9.
bestNumCoins numCoins + 1
10.
return rn bestNum umCoins
It recalculates the optimal coin combination
i.e., M = 77, c = (1,3,7):
Optimal coin combo for 70 cents is
74 77 76 70 75 73 69 73 71 67 69 67 63 74 72 68 72 70 66 68 66 62 72 70 66 70 68 64 66 64 60 68 66 62 66 64 60 62 60 56
70 70 70 70 70
We’re re-computing values in our algorithm more
Save results of each computation for 0 to M This way, we can do a reference call to find an
Running time M*d, where M is the value of money
1.
hange( nge(M,c,d) 2. 2. bes estNumC NumCoins
3. 3. for m 1 t 1 to M 4. 4. bestNum NumCoins
nity ty 5. 5. for i i 1 t 1 to d 6. 6. if m ≥ ci 7. 7. if bestNum umCoins
m – ci+ 1 < b
bestNumCoins NumCoinsm 8. 8. bes estNum umCoi Coins nsm bes estNumC NumCoins
m – ci+ 1
9. 9. return n bestNum umCoins
0 1 0 1 2 0 1 2 3 0 1 2 3 4 0 1 2 3 4 5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9
1 1 2 1 2 1 1 2 1 2 1 2 1 2 3 1 2 1 2 3 2 1 2 1 2 3 2 1 1 2 1 2 3 2 1 2 1 2 1 2 3 2 1 2 3
Sink
Source
Sink
Source
3 2 4 7 3 3 3 1 3 2 4 4 5 6 4 6 5 5 8 2 2 5
1 2 3 1 2 3
j coordinate i coordinate
13
source sink
4
3 2 4 1 2 4 3 3 1 1 2 2 2 4
19 9 5 15 23 20 3
4
1 2 5 2 1 5 2 3 4 5 3 3 5 10 3 5 5 1 2
promising start, but leads to bad choices! source sink
18 22
MT( T(n,m n,m) if n=0
return rn MT( T(n,m) m) x x MT( T(n-1, 1,m)+ m)+ len ength h of the e ed edge e from (n (n- 1, 1,m) to (n,m) y y MT( T(n,m-1) 1)+ length h of the edge from (n,m-1) 1) to (n,m) ret eturn n max ax{x, x,y} y}
MT( T(n,m n,m) x x MT( T(n-1, 1,m)+ m)+ length h of the edge from (n (n- 1, 1,m) to (n,m) y y MT( T(n,m-1) 1)+ len ength h of the e ed edge e from (n,m-1) 1) to (n,m) return n min{x, x,y} y}
1 5 1 1
i source 1 5 S1,0 = 5 S0,1 = 1
score plus the weight of the respective edge in between
j
1 2 5 3 1 2 1 2
source 1 3 5 8 4 S2,0 = 8 i S1,1 = 4 S0,2 = 3
3
j
1 2 5 3 1 2 3 1 2 3
i source 1 3 5 8 8 4
5
8
10 3 5
9 13
1
S3,0 = 8 S2,1 = 9 S1,2 = 13 S3,0 = 8 j
greedy alg. fails!
1 2 5
1
3 5 3 3 5 10
1 2 3 1 2 3
i source 1 3 8 5 8 8 4 9 13 8 9 12 S3,1 = 9 S2,2 = 12 S1,3 = 8 j
1 2 5
1
3 3 5 3 3 5 10
2 1 2 3 1 2 3
i source 1 3 8 5 8 8 4 9 13 8 12 9 15 9 j S3,2 = 9 S2,3 = 15
1 2 5
1
3 3 5 3 3 5 10
2 1 2 3 1 2 3
i source 1 3 8 5 8 8 4 9 13 8 12 9 15 9 j
1
16 S3,3 = 16 (showing all back-traces)
si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j)
sB = max
sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) B A3 A1 A2
sx = max
sy + weight of vertex (y, x) where y є Predecessors(x)
leading to x
(V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once
sv =
max
su1 + + we weight of edge from u1 to v su2 + + we weight of edge from u2 to to v su3 + + we weig ight of edge from u3 to v
a) b) c)
A T
T A T
T C G
letters of v letters of w
T T
5 matches 2 insertions 2 deletions
V = ATCTGATG W = TGCATAC n = 8 m = 7
match deletion insertion mismatch indels
4 1 2 2
matches mismatch insertions deletions
A T
T G A T C
G C T
elements of v elements of w
1 2 1 2 2 3 3 4 3 5 4 5 5 6 6 6 7 7 8 j coords: i coords: Matches shown in red positions in v: positions in w: 2 < 3 < 4 < 6 < 8 1 < 3 < 5 < 6 < 7
(0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7)
T G C A T A C
1 2 3 4 5 6 7 i
A T C T G A T C
1 2 3 4 5 6 7 8 j
T G C A T A C
1 2 3 4 5 6 7 i
A T C T G A T C
1 2 3 4 5 6 7 8 j
T G C A T A C
1 2 3 4 5 6 7 i
A T C T G A T C
1 2 3 4 5 6 7 8 j
Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges
si, j = max si-1, j si, j-1 si-1, j-1 + 1 if vi = wj
si,j = MAX si-1,j + 0 si,j -1 + 0 si-1,j -1 + 1, if vi = wj i,j i-1,j i,j -1 i-1,j -1
1
1 2 3 4
1 2 3 4
W A T C G A T G T V
0 1 2 2 3 4 V = A T - G T | | | W= A T C G – 0 1 2 3 4 4
V = ATATATAT W = TATATATA
Computing Hamming distance is a trivial task.
V = ATATATAT W = TATATATA
Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task W = TATATATA
Just one shift Make it all line up
V = - ATATATAT
V = ATATATAT W = TATATATA
W = TATATATA V = - ATATATAT
(0,0) ,0) , , (1,1) (1,1) , , (2, (2,2), ( 2), (2,3 2,3), ), (0,0) ,0) , , (1,1) (1,1) , , (2, (2,2), ( 2), (2,3 2,3), ), (3,4) ,4), ( , (4,5), 4,5), (5 (5,5) ,5), (6, , (6,6), 6), (3,4) ,4), ( , (4,5), 4,5), (5 (5,5) ,5), (6, , (6,6), 6), (7 (7,6) 6), ( (7,7) ,7) (7 (7,6) 6), ( (7,7) ,7)
Old Alignment Old Alignment 01223 01223 01223 0122345 45 45 45677 677 677 677 v= AT_G v= AT_G v= AT_G v= AT_GTT TT TT TTAT_ AT_ AT_ AT_ w= ATCG w= ATCG w= ATCG w= ATCGT_ T_ T_ T_A_C A_C A_C A_C 01234 01234 01234 0123455 55 55 55667 667 667 667 New Alignment New Alignment 01223 01223 01223 0122345 45 45 45677 677 677 677 v= AT_G v= AT_G v= AT_G v= AT_GTT TT TT TTAT_ AT_ AT_ AT_ w= ATCG w= ATCG w= ATCG w= ATCG_T _T _T _TA_C A_C A_C A_C 01234 01234 01234 0123445 45 45 45667 667 667 667
max si-1, j
Si,j = Si-1, j-1
max Si-1, j
Si, j-1
value from NW +1, if vi = wj value from North (top) value from West (left)
Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2
2,2 = [
[s1,1
1,1 = 1] +
= 1] + 1 1 s2,5
2,5 = [
[s1,4
1,4 = 1] +
= 1] + 1 s4,2
4,2 = [s
= [s3,1
3,1 = 1]
= 1] + + 1 s5,2
5,2 = [
[s4,1
4,1 = 1] +
= 1] + 1 s7,2
7,2 = [
[s6,1
6,1 = 1] +
= 1] + 1
max si-1, j
max si-1, j+0
This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero.
1. 1.
LCS(v,w) LCS(v,w)
2.
for
3.
si,0 0
4.
for
5.
s0,j 0
6.
for
7.
for
8.
si-1,j
9.
si,j max si,j-1
if si,j = si-1,j
bi,j “ “ if
if si,j = si,j-1
“ “ if
if si,j = si-1,j-1 + 1
return return (sn,m, b)
LCS(v,w) created the
Now we need a way
Follow the arrows
1. 1.
PrintL ntLCS CS(b,v b,v,i,j)
2.
if if i = 0 or j = 0
3.
return
4.
if if bi,j = “ “
5.
PrintLCS tLCS(b,v, v,i-1,j-1)
6.
print vi
7. else 8.
if if bi,j = “ “
9.
PrintL ntLCS(b,v, v,i-1,j)
10.
else
11.
PrintLCS tLCS(b,v, v,i,j-1)
It takes O(nm) time to fill in the nxm dynamic
Use the partial digest algorithm to find X