binf6201 8201 sequence alignment algorithms
play

BINF6201/8201 Sequence alignment algorithms 10-04-2016 What is an - PowerPoint PPT Presentation

BINF6201/8201 Sequence alignment algorithms 10-04-2016 What is an efficient algorithm ? Our first goal is to make a pairwise alignment between two sequences that reveals the evolutionary relationship between them using a computational method


  1. BINF6201/8201 Sequence alignment algorithms 10-04-2016

  2. What is an efficient algorithm ? Ø Our first goal is to make a pairwise alignment between two sequences that reveals the evolutionary relationship between them using a computational method or an algorithm. Ø An algorithm is a series of executable step-by-step instructions that lead to a solution to a problem. Ø For some problems, there are no algorithms to solve them, therefore they cannot be solved on any computer. Ø For some other problems, there are algorithms to solve them. Ø The speed of an algorithm can increase quickly or slowly when the size of the problem increases. Ø If the speed of an algorithm increases quickly when the size increases, the problem cannot be solved in a practical period of time when its size is large. Ø Computer scientists strive to find a faster algorithm for computer- solvable problems.

  3. Sorting algorithms Ø The sorting problem: given a list of numbers, put them in a ascending/descending order. Ø The insertion sorting algorithm: For each number from the second number to the last number, do Compare it with the numbers before it; If the number before it is greater than it, swap the two numbers; If the number before it is less than it, stop comparison. Ø If there are n numbers in the list, in the worst case, the algorithm needs to do (1+2+…+n) comparisons, which is in the order of n 2 . Ø We say the program has O ( n 2 ) time complexity.

  4. Merger two sorted lists in one sorted list Ø There are other sorting algorithms running at the same speed as or slower, or faster speed than the insertion sorting algorithm. The known fastest one is the merger sorting algorithm. Ø An algorithm to merger two sorted lists of numbers in a larger sorted list: 9 9 27 18 Merge 22 18 35 27 35 22 49 49 71 71 82 82 Merger algorithm: When two lists are not empty do: Compare the first two numbers in the two lists; put the smaller in the third list in order; Copy the number in the non-empty list to the third list in order.

  5. Sorting algorithms Ø The merger sorting algorithm: First, divide the unsorted list into two equal parts repetitively until only one or no number remains, then apply the merger sorting procedure to each part starting from the smallest. Ø If there are n numbers 71 22 9 18 35 82 27 49 in the list, in the worst case, the algorithm needs to do nlog 2 n 71 22 9 18 35 82 27 49 comparisons. Ø We say that the 71 22 9 18 35 82 27 49 program has O ( nlog 2 n ) time complexity. 71 9 35 27 22 18 82 49 Ø Therefore the merger 22 71 9 18 35 82 27 49 sorting algorithm is faster than the 9 18 22 71 27 35 49 82 insertion sorting algorithm when n is 9 18 22 27 35 49 71 82 big.

  6. Polynomial and Non-polynomial algorithms Ø Both insertion sorting and merger sorting algorithms belong to a class of algorithms called the polynomial algorithms (P) , because they run in the order of O ( n r ) , where r is a constant. Ø All practical algorithms belong to P. Ø For many theoretically computer-solvable problems, a polynomial algorithm has not been found. Therefore, they can only be exactly solved in a very small scale using a non-polynomial algorithm. A subset of these problems are called NP-hard problems Ø The travelling salesman problem (TSP) is one of NP-hard problems: Find the shortest path for a salesman to visit each of n cities exactly once and then return to his starting city. Ø Often, we can use a heuristic algorithm to solve these problems approximately without guarantee of an optimal solution.

  7. Pairwise sequence alignment algorithms Ø An example of a pairwise alignment between two sequences: a: CAGT-AGATATTTACGGCAGTATC---- b: CAATCAGGATTTT--GGCAGACTGGTTG Ø Here, in order to achieve a better alignment between the two sequences, indels or gaps have to be included. Ø We already know how to score a pairwise gap-free alignment, using l the formula, ∑ S ( a , b ) S ( a , b ) . = alignemnt k k k 1 = Ø When a gap is introduced in the alignment, a penalty score must be given to reflect such a potentially deleterious mutational event. Ø The commonly used penalty scoring functions include, Linear penalty : W ( l ) gl , and = Affine gap penalty : W ( l ) g g ( l 1 ). = + − open ext Where l is the length of the gap; g , penalty for one space; g open , penalty for opening a gap; g ext , penalty for extending a gap by one space.

  8. Pairwise sequence alignment algorithms Ø The affine gap penalty function is close enough to the ideal gap penalty function shown in the figure, which should increases fast when l is small, and slowly when l is large. CAGT-AGATATTTACGGCAGTATC---- CAATCAGGATTTT--GGCAGACTGGTTG Ø Given a scoring matrix such as PAM250 or BLOSUM50, and a gap penalty function, we can compute the score of an alignment as, S ( a , b ) alignemnt ∑ ∑ S ( a , b ) W ( l ) . = − k k aligned gaps pairs

  9. Pairwise sequence alignment algorithms Ø The pairwise alignment problem can be formulated as follows. Given a scoring system for matches and gaps, find the alignment of two sequences that has the maximum score among all possible ways to align the two sequences. ∑ ∑ S ( a , b ) S ( a , b ) W ( l ) . = − k k aligned gaps pairs Ø If sequence a has a length m , and sequence b has a length n , then they are ways to align them. m n m n + + ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ n m ⎝ ⎠ ⎝ ⎠ Ø To derive this formula, think the ways that we can intercalate the two sequences while preserving their respective order. Sequence a: This alignment corresponds to CAGT-AGATA the following intercalation: CAATCAGGAT Sequence b: CCAAGATTCAAGGAGTAAT Intercalation:

  10. Pairwise sequence alignment algorithm Ø Therefore, if we want to find the optimal alignment by evaluating all of these possible alignments, the solution is not polynomial, in stead, it is exponential, because if m = n , then, m n 2 n ( 2 n )! 2 + ⎛ ⎞ ⎛ ⎞ 2 n . ⎜ ⎟ ⎜ ⎟ = = ≈ ⎜ ⎟ ⎜ ⎟ 2 n n ( n ! ) n π ⎝ ⎠ ⎝ ⎠ Ø Fortunately, a polynomial algorithm ( O ( mn )) was found to align two sequences globally--- the Needleman and Wunsch algorithm (1970), using a dynamic programming method. Ø Dynamic programming is a type of algorithms to find an optimal value of a function based on previously computed optimal values. Ø A dynamic programming algorithm speeds up the computation by avoiding repetitive calculations.

  11. Needleman-Wunsch global alignment algorithm Ø Let a and b be two sequences of length m and n , and H ( s, t ) be the score to align the first s letters in a with the first t letter in b : H ( s, t ) a : a 1 a 2 ... a s − 1 a s ... a m s m b : b 1 b 2 ... b t − 1 b t ... b n n t Ø Using linear gap penalty function, we can compute H ( i, j ) based on the three possible ways to add the i-th and j-th letter to the previous alignments, and choose the alignment that maximizes H ( i, j ) . H ( i-1, j-1 ) S ( a i b , ) j a i align with b j i-1 i H ( i , j ) H ( i 1 , j 1 ) S ( a i b , ) = − − + j j-1 j a i align with a gap b j align with a gap g − g H ( i-1, j ) H ( i, j-1 ) − H ( i , j ) H ( i , j ) i-1 i i H ( i 1 , j ) g H ( i , j 1 ) g j j-1 j = − − = − −

  12. Needleman-Wunsch global alignment algorithm Ø If we use the linear gap penalty function, then mathematically, we have the following recursion relations : H ( i 1 , j 1 ) S ( a , b ) − − + ⎧ i j ⎪ H ( i , j ) max H ( i 1 , j ) g = − − ⎨ ⎪ H ( i , j 1 ) g − − ⎩ with the initial condition H (0,0)=0, H ( i ,0)=- ig , and H (0, j )=- jg. Ø This corresponds to filling out the matrix H with the elements being H ( i,j ). H is called the dynamic programming matrix of a and b . Ø We fill the matrix H from the upper-left corner to the bottom-right corner. Ø To recover the optimal alignment later, we use another matrix to keep the track on how each H ( i,j ) is computed. Ø As H has mn elements to fill, this algorithm runs in the order of O ( mn ) .

  13. Needleman-Wunsch global alignment algorithm Ø Initialize the first row and column, and then compute each cell from the upper left corner to the bottom right corner. b : b b b ... b b ... b 0 1 2 j 1 j n − a : a 0 - g - 2 g ... - ( j- 1 ) g - jg ... - ng 0 s ( a 1 b , ) + g − 1 a - g g H ( 1 , 1 ) 1 − a - 2 g 2 ... ... a - ( i- 1 ) g H ( i 1 , j ) H ( i 1 , j 1 ) − − − i 1 − g − S ( a i b , ) + j a - ig H ( i , j ) H ( i , j 1 ) − i g − ... ... a - mg m

  14. Needleman-Wunsch global alignment algorithm Ø Let’s use an example to illustrate this matrix filling process. Ø We want to align the following two amino acid sequences using the PAM250 scoring matrix and a linear gap penalty function W( l ) = -6 l. a: SHAKE b: SPEARE Amino acid alignment scores taken from PAM250 2 1 1 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend