sequence alignment
play

Sequence alignment Alignment specifies which positions in two - PowerPoint PPT Presentation

Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not


  1. Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2007 41

  2. Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2007 42

  3. Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss each of these problems briefly. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2007 43

  4. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2007 44

  5. Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) Every position in both sequences is included in the alignment l We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y Total score: sum of position scores l S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2007 45

  6. Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2007 46

  7. Introduction to dynamic programming: the money change problem Suppose you buy a pen for 4.23€ and pay for with a l 5€ note You get 77 cents in change – what coins is the cashier l going to give you if he or she tries to minimise the number of coins? The usual algorithm: start with largest coin l (denominator), proceed to smaller coins until no change is left: − 50, 20, 5 and 2 cents This greedy algorithm is incorrect , in the sense that it l does not always give you the correct answer Introduction to bioinformatics, Autumn 2007 47

  8. The money change problem • How else to compute the 77 change? 50 20 5 • We could consider all possible 27 57 72 ways to reduce the amount of change • Suppose we have 77 cents 7 22 7 37 52 22 52 67 change, and the following coins: 50, 20, 5 cents … • We can compute the change • Many values are computed with recursion more than once! • Figure shows the recursion • This leads to a correct but tree for the example very inefficient algorithm Introduction to bioinformatics, Autumn 2007 48

  9. The money change problem We can speed the computation up by solving the l change problem for all i � n − Example: solve the problem for 9 cents with available coins being 1, 2 and 5 cents Solve the problem in steps, first for 1 cent, then 2 l cents, and so on In each step, utilise the solutions from the previous l steps Introduction to bioinformatics, Autumn 2007 49

  10. The money change problem Amount of 0 1 2 3 4 5 6 7 8 9 change left Algorithm runs in time proportional to Md, where M is l the amount of change and d is the number of coin types The same technique of storing solutions of l subproblems can be utilised in aligning sequences Introduction to bioinformatics, Autumn 2007 50

  11. Representing alignments and scores Alignments can be represented in the - W H A T following tabular form. Each alignment - corresponds to a path through the table. W X WHAT H X X || Y X WH-Y Introduction to bioinformatics, Autumn 2007 51

  12. Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2007 52

  13. Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3 Case 3. Align H in WHAT against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2007 53

  14. Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2007 54

  15. Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2007 55

  16. Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2007 56

  17. Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2007 57

  18. Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2007 58

  19. Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2007 59

  20. Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2007 60

  21. Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2007 61

  22. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2007 62

  23. Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2007 63

  24. Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2007 64

  25. From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2007 65

  26. Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: , Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, Autumn 2007 66

  27. Allowing preceding and trailing indels • First row and column 0 1 2 3 4 initialised to zero: - b 1 b 2 b 3 b 4 M i,0 = M 0,j = 0 0 0 0 0 0 0 - 0 1 a 1 b1 b2 b3 2 a 2 0 - - a1 0 3 a 3 Introduction to bioinformatics, Autumn 2007 67

  28. Recursion for local alignment • M i,j = max { - T G G T G M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j � � , A 0 0 0 0 0 0 M i,j-1 � � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2007 68

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend