CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - PowerPoint PPT Presentation

Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 A -3 3

Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A j = 0 0 -1 -2 -3 -4 Optimal Alignment: F(4,3) = 2 1 A -1 1 0 -1 -2 This only tells us the 2 T -2 0 0 1 0 best score 3 A -3 -1 -1 0 2

Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A 2 T -2 0 0 1 0 A 3 A -3 -1 -1 0 2

Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 T A 2 T -2 0 0 1 0 T A 3 A -3 -1 -1 0 2

Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 G T A 2 T -2 0 0 1 0 - T A 3 A -3 -1 -1 0 2

Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A G T A 2 T -2 0 0 1 0 A - T A 3 A -3 -1 -1 0 2

Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 2 T -2 3 A -3

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 3 A -3

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

The Needleman-Wunsch Algorithm 1. Initialization. a. F(0, 0) = 0 = - j  d b. F(0, j) = - i  d c. F(i, 0) 2. Main Iteration. Filling in scores For each i = 1……M a. For each j = 1……N F(i-1,j) – d [case 1] F(i, j-1) – d F(i, j) = max [case 2] F(i-1, j-1) + σ (x i , y j ) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Complexity • Time: O(NM) • Space: O(NM) • Linear-space algorithms do exist (with the same time complexity)

Equivalent graph problem S1 = G A T A (0,0) → : a gap in the 2 nd sequence 1 1 S2 = A  : a gap in the 1 st sequence 1 : match / mismatch T Value on vertical/horizontal line: -d 1 1 Value on diagonal: m or -s A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

Question • If we change the scoring scheme, will the optimal alignment be changed? – Old: Match = 1, mismatch = gap = -1 – New: match = 2, mismatch = gap = 0 – New: Match = 2, mismatch = gap = -2?

Question • What kind of alignment is represented by these paths? A A A A A B B B B B C C C C C A- A-- --A -A- -A BC -BC BC- B-C BC Alternating gaps are impossible if – s > -2d

A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Score = -7 Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG Score = -2 The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1)

A variant of the basic algorithm • Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- • Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: x 1 ……………………………… x M y N ……………………………… y 1 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination max i F(i, N) F OPT = max max j F(M, j)

Different types of overlaps x x y y

The local alignment problem X = x 1 ……x M , Given two strings Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c -de x y

Why local alignment • Conserved regions may be a small part of the whole – Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C

Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n 2 ) choices for A, O(m 2 ) for B, O(nm) for DP, so O(n 3 m 3 ) total.

Reminder • The overlap detection algorithm – We do not give penalty to gaps at either end Free gap Free gap

The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment – By resetting alignment score to zero

The Smith-Waterman algorithm Initialization : F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration : F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) +  (x i , y j )

The Smith-Waterman algorithm Termination : 1. If we want the best local alignment… F OPT = max i,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 Gap: -1 b 0 c 0 x 0 d 0 e 0 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 x 0 d 0 e 0 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 d 0 e 0 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 e 0 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0

x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 cxde | || c 0 0 0 0 2 1 0 c-de x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 x-de | || e 0 0 0 0 0 2 5 xcde x 0 2 2 2 1 1 4

• No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “ Smith-Waterman ” • Global alignment: “ Needleman-Wunsch ”

Analysis • Time: – O(MN) for finding the best alignment – Time to report all alignments depends on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible

More efficient alignment algorithms

• Given two sequences of length M, N • Time: O(MN) – Ok, but still slow for long sequences • Space: O(MN) – bad – 1Mb seq x 1Mb seq = 1TB memory • Can we do better?

Bounded alignment Good alignment should appear near the diagonal

Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k x i | i – j | < k Then, | implies y j

Bounded Dynamic Programming x 1 ………………………… x M Initialization: y N ………………………… y 1 F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+  (x i , y j ) F(i, j – 1) – d, if j > i – k F(i, j) = max F(i – 1, j) – d, if j < i + k Termination: same k

Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

• Given two sequences of length M, N • Time: O(MN) – ok • Space: O(MN) – bad – 1mb seq x 1mb seq = 1TB memory • Can we do better?

Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences X M Y N Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

(0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

(0,0) (3,2) (3,0) (3,4) (3,6) (6,6) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x 1 x 2 …x M/2 with Y M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k

Backward Algorithm Y X Backward algorithm Align reverse(x M/2+1 …x M ) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(x M/2+1 …x M ) and reverse(y k y k+1 …y N )

Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation:

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science which is greater than or equal to x ? 1 Problem

CPSC 490: Problem Solving in Computer Science Assignment 3 is due Friday, Feb 1 at 8 pm.

CPSC 490: Problem Solving in Computer Science Considering SCCs is useful when answering

CPSC 490: Problem Solving in Computer Science Select a presentation topic by Friday, March 22.

CPSC 490: Problem Solving in Computer Science 1 Convex sets A set S is convex x , y S

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Jan 15 at noon.

CPSC 490: Problem Solving in Computer Science A bipartite graph is: and Y . A graph with no

CPSC 490: Problem Solving in Computer Science Assignment 2 is due Tuesday at noon.

CPSC 490: Problem Solving in Computer Science Presentations are next week (Feb 12 and 14).

CPSC 490: Problem Solving in Computer Science Assignment 4 is due tonight. Presentation 2

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Tuesday at noon. Start now!

CPSC 490: Problem Solving in Computer Science of money. You may buy fractional items (then you

Problems with metagenome annotation How much has been sequenced? Number of known sequences 100

NMR Spectral Assignment and Structural Calculations Lucia Banci CERM University of Florence

ER-DCOPS: A FRAMEWORK FOR DCOP WITH UNCERTAINTY IN CONSTRAINT UTILITIES Tiep Le, Ferdinando

Accessibility Education Todays Workshop Accessibility Education: A New Certificate and

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 the other hand, in the red fruit of

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

Pairwise Sequence Alignment Todays Goal > DNA Sequence 1

Programming in Python Lecture 3: Patterns and Functions Michael Schroeder Sven Schreiber

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation:

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science which is greater than or equal to x ? 1 Problem

CPSC 490: Problem Solving in Computer Science Assignment 3 is due Friday, Feb 1 at 8 pm.

CPSC 490: Problem Solving in Computer Science Considering SCCs is useful when answering

CPSC 490: Problem Solving in Computer Science Select a presentation topic by Friday, March 22.

CPSC 490: Problem Solving in Computer Science 1 Convex sets A set S is convex x , y S

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Jan 15 at noon.

CPSC 490: Problem Solving in Computer Science A bipartite graph is: and Y . A graph with no

CPSC 490: Problem Solving in Computer Science Assignment 2 is due Tuesday at noon.

CPSC 490: Problem Solving in Computer Science Presentations are next week (Feb 12 and 14).

CPSC 490: Problem Solving in Computer Science Assignment 4 is due tonight. Presentation 2

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Tuesday at noon. Start now!

CPSC 490: Problem Solving in Computer Science of money. You may buy fractional items (then you

Problems with metagenome annotation How much has been sequenced? Number of known sequences 100

NMR Spectral Assignment and Structural Calculations Lucia Banci CERM University of Florence

ER-DCOPS: A FRAMEWORK FOR DCOP WITH UNCERTAINTY IN CONSTRAINT UTILITIES Tiep Le, Ferdinando

Accessibility Education Todays Workshop Accessibility Education: A New Certificate and

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 the other hand, in the red fruit of

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

Pairwise Sequence Alignment Todays Goal &gt; DNA Sequence 1

Programming in Python Lecture 3: Patterns and Functions Michael Schroeder Sven Schreiber

Pairwise Sequence Alignment Todays Goal > DNA Sequence 1