BINF6201/8201 Sequence alignment algorithms 10-04-2016 What is an - - PowerPoint PPT Presentation

binf6201 8201 sequence alignment algorithms
SMART_READER_LITE
LIVE PREVIEW

BINF6201/8201 Sequence alignment algorithms 10-04-2016 What is an - - PowerPoint PPT Presentation

BINF6201/8201 Sequence alignment algorithms 10-04-2016 What is an efficient algorithm ? Our first goal is to make a pairwise alignment between two sequences that reveals the evolutionary relationship between them using a computational method


slide-1
SLIDE 1

BINF6201/8201 Sequence alignment algorithms

10-04-2016

slide-2
SLIDE 2

What is an efficient algorithm ?

Ø Our first goal is to make a pairwise alignment between two sequences that reveals the evolutionary relationship between them using a computational method or an algorithm. Ø An algorithm is a series of executable step-by-step instructions that lead to a solution to a problem. Ø For some problems, there are no algorithms to solve them, therefore they cannot be solved on any computer. Ø For some other problems, there are algorithms to solve them. Ø The speed of an algorithm can increase quickly or slowly when the size of the problem increases. Ø If the speed of an algorithm increases quickly when the size increases, the problem cannot be solved in a practical period of time when its size is large. Ø Computer scientists strive to find a faster algorithm for computer- solvable problems.

slide-3
SLIDE 3

Sorting algorithms

Ø The sorting problem: given a list of numbers, put them in a ascending/descending order. Ø The insertion sorting algorithm:

For each number from the second number to the last number, do Compare it with the numbers before it; If the number before it is greater than it, swap the two numbers; If the number before it is less than it, stop comparison.

Ø If there are n numbers in the list, in the worst case, the algorithm needs to do (1+2+…+n) comparisons, which is in the order of n2. Ø We say the program has O(n2) time complexity.

slide-4
SLIDE 4

Ø There are other sorting algorithms running at the same speed as or slower, or faster speed than the insertion sorting algorithm. The known fastest one is the merger sorting algorithm. Ø An algorithm to merger two sorted lists of numbers in a larger sorted list: 9 18 22 71 27 35 49 82

Merger two sorted lists in one sorted list

9 18 22 27 35 49 71 82 When two lists are not empty do: Compare the first two numbers in the two lists; put the smaller in the third list in order; Copy the number in the non-empty list to the third list in order. Merge Merger algorithm:

slide-5
SLIDE 5

Ø The merger sorting algorithm: First, divide the unsorted list into two equal parts repetitively until only one or no number remains, then apply the merger sorting procedure to each part starting from the smallest. Ø If there are n numbers

in the list, in the worst case, the algorithm needs to do nlog2n comparisons. Ø We say that the program has O(nlog2n) time complexity. Ø Therefore the merger sorting algorithm is faster than the insertion sorting algorithm when n is big.

71 22 9 18 35 82 27 49 71 22 9 18 35 82 27 49 9 18 71 22 35 82 27 49 71 22 9 18 35 82 27 49 22 71 9 18 9 18 22 71 35 82 27 49 27 35 49 82 9 18 22 27 35 49 71 82

Sorting algorithms

slide-6
SLIDE 6

Polynomial and Non-polynomial algorithms

Ø Both insertion sorting and merger sorting algorithms belong to a class

  • f algorithms called the polynomial algorithms (P), because they run

in the order of O(nr), where r is a constant. Ø All practical algorithms belong to P. Ø For many theoretically computer-solvable problems, a polynomial algorithm has not been found. Therefore, they can only be exactly solved in a very small scale using a non-polynomial algorithm. A subset of these problems are called NP-hard problems Ø The travelling salesman problem (TSP) is one of NP-hard problems: Find the shortest path for a salesman to visit each of n cities exactly

  • nce and then return to his starting city.

Ø Often, we can use a heuristic algorithm to solve these problems approximately without guarantee of an optimal solution.

slide-7
SLIDE 7

Pairwise sequence alignment algorithms

Ø An example of a pairwise alignment between two sequences: a: CAGT-AGATATTTACGGCAGTATC---- b: CAATCAGGATTTT--GGCAGACTGGTTG Ø Here, in order to achieve a better alignment between the two sequences, indels or gaps have to be included. Ø We already know how to score a pairwise gap-free alignment, using the formula, Ø When a gap is introduced in the alignment, a penalty score must be given to reflect such a potentially deleterious mutational event. Ø The commonly used penalty scoring functions include,

. ) , ( ) , (

1

=

=

l k k k alignemnt

b a S b a S ). 1 ( ) ( : penalty gap Affine and , ) ( : penalty Linear − + = = l g g l W gl l W

ext

  • pen

Where l is the length of the gap; g, penalty for one space; gopen, penalty for opening a gap; gext, penalty for extending a gap by one space.

slide-8
SLIDE 8

Pairwise sequence alignment algorithms

Ø The affine gap penalty function is close enough to the ideal gap penalty function shown in the figure, which should increases fast when l is small, and slowly when l is large. CAGT-AGATATTTACGGCAGTATC---- CAATCAGGATTTT--GGCAGACTGGTTG Ø Given a scoring matrix such as PAM250 or BLOSUM50, and a gap penalty function, we can compute the score of an alignment as,

. ) ( ) , ( ) , (

∑ ∑

− =

gaps pairs aligned k k alignemnt

l W b a S b a S

slide-9
SLIDE 9

Pairwise sequence alignment algorithms

Ø The pairwise alignment problem can be formulated as follows. Given a scoring system for matches and gaps, find the alignment of two sequences that has the maximum score among all possible ways to align the two sequences.

. ) ( ) , ( ) , (

∑ ∑

− =

gaps pairs aligned k k

l W b a S b a S

Ø If sequence a has a length m, and sequence b has a length n, then they are ways to align them. Ø To derive this formula, think the ways that we can intercalate the two sequences while preserving their respective order.

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + m n m n n m

CAGT-AGATA CAATCAGGAT CCAAGATTCAAGGAGTAAT

Sequence a: Sequence b: Intercalation: This alignment corresponds to the following intercalation:

slide-10
SLIDE 10

Pairwise sequence alignment algorithm

Ø Therefore, if we want to find the optimal alignment by evaluating all

  • f these possible alignments, the solution is not polynomial, in stead, it

is exponential, because if m = n, then, Ø Fortunately, a polynomial algorithm (O(mn)) was found to align two sequences globally--- the Needleman and Wunsch algorithm (1970), using a dynamic programming method. Ø Dynamic programming is a type of algorithms to find an optimal value

  • f a function based on previously computed optimal values.

Ø A dynamic programming algorithm speeds up the computation by avoiding repetitive calculations.

. 2 ) ! ( )! 2 ( 2

2

2

n n n n n n n m

n

π ≈ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +

slide-11
SLIDE 11

Needleman-Wunsch global alignment algorithm

Ø Let a and b be two sequences of length m and n, and H(s, t) be the score to align the first s letters in a with the first t letter in b:

a : a1 a 2...a s−1a s...a m b : b1 b 2... b t−1b t... b n

Ø Using linear gap penalty function, we can compute H(i, j) based on the three possible ways to add the i-th and j-th letter to the previous alignments, and choose the alignment that maximizes H(i, j).

s t H(s, t) i-1 j-1 H(i-1, j-1) i-1 j H(i-1, j) i j-1 H(i, j-1) i j i j m n

) , ( ) 1 , 1 ( ) , (

j i b

a S j i H j i H + − − =

) , (

j i b

a S g − g −

g j i H j i H − − = ) , 1 ( ) , ( g j i H j i H − − = ) 1 , ( ) , (

ai align with bj ai align with a gap bj align with a gap

slide-12
SLIDE 12

Needleman-Wunsch global alignment algorithm

Ø If we use the linear gap penalty function, then mathematically, we have the following recursion relations: with the initial condition H(0,0)=0, H(i,0)=-ig, and H(0,j)=-jg. ⎪ ⎩ ⎪ ⎨ ⎧ − − − − + − − = g j i H g j i H b a S j i H j i H

j i

) 1 , ( ) , 1 ( ) , ( ) 1 , 1 ( max ) , ( Ø This corresponds to filling out the matrix H with the elements being H(i,j). H is called the dynamic programming matrix of a and b. Ø We fill the matrix H from the upper-left corner to the bottom-right corner. Ø To recover the optimal alignment later, we use another matrix to keep the track on how each H(i,j) is computed. Ø As H has mn elements to fill, this algorithm runs in the order of O(mn).

slide-13
SLIDE 13

Needleman-Wunsch global alignment algorithm

mg a ... ... ig a g i- a ... ... g a a ng ... jg g j- ... g a a b b b b b b b

m i i n j j

  • )

1 (

  • 2
  • g
  • )

1 (

  • 2
  • g
  • :

... ... :

1 2 1 1 2 1 − −

g −

) 1 , 1 ( − − j i H ) , 1 ( j i H −

) 1 , ( − j i H

) , (

j i b

a S +

g −

) , ( j i H

Ø Initialize the first row and column, and then compute each cell from the upper left corner to the bottom right corner.

) , (

1 1 b

a s + g − g −

) 1 , 1 ( H

slide-14
SLIDE 14

Needleman-Wunsch global alignment algorithm

Ø Let’s use an example to illustrate this matrix filling process. Ø We want to align the following two amino acid sequences using the PAM250 scoring matrix and a linear gap penalty function W(l) = -6l. a: SHAKE b: SPEARE Amino acid alignment scores taken from PAM250

2 1 1

slide-15
SLIDE 15

Needleman-Wunsch global alignment algorithm

Ø The optimal alignment is found by following the track as how each element is computed starting from the cell H(m,n). Ø If a cell is computed based on the upper-left cell H(i-1,j-1) (travel diagonally), we align ai with bj. Ø If a cell is computed based on the upper cell H(i-1,j) (travel vertically), we align ai with a space. S-HAKE SPEAKE Ø If a cell is computed based on the left cell H(i,j-1) (travel horizontally), we align bi with a space. Ø This procedure is called backtracking. Ø For our toy example, we have the following alignment:

2 1 1