sequence alignment chapter 6
play

Sequence Alignment (chapter 6) p The biological problem p Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200 Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a


  1. Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200

  2. Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. 201

  3. Local alignment: rationale A B Regions of similarity p Global alignment would be inadequate p Problem: find the highest scoring local alignment between two sequences p Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) 202

  4. From global to local alignment p Modifications to the global alignment algorithm n Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: n Allow preceding and trailing indels without penalty 203

  5. Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: Best local alignment score: where S(I, J) is the alignment score for substrings I and J. 204

  6. Allowing preceding and trailing indels p First row and column 0 1 2 3 4 initialised to zero: b 1 b 2 b 3 b 4 - M i,0 = M 0,j = 0 0 0 0 0 0 0 - a 1 0 1 b1 b2 b3 0 2 a 2 - - a1 a 3 0 3 205

  7. Recursion for local alignment - T G G T G p M i,j = max { M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j – � , A 0 0 0 0 0 0 M i,j-1 – � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 Allow alignment to start anywhere in sequences T 0 1 0 0 2 0 206

  8. Finding best local alignment Optimal score is the highest - T G G T G p value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 Best local alignment can be p found by backtracking from the C 0 0 0 0 0 0 highest value in M G 0 0 1 1 0 1 What is the best local p alignment in this example? T 0 1 0 0 2 0 207

  9. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 208

  10. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 0 0 0 0 2 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 209

  11. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Optim al local alignm ent: - G G C T C A A T C A C T – A A 0 - 0 0 0 0 0 0 0 0 0 0 0 C T C A A 1 A 0 0 0 0 0 0 2 2 0 0 2 2 C 0 0 0 2 0 2 0 1 1 2 0 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 Scoring (for example) 7 G 0 2 2 0 0 0 3 4 5 3 1 Match: + 2 Mismatch: -1 8 G 0 2 4 2 0 0 1 2 3 4 2 Indel: -2 210

  12. Multiple optimal alignments Non-optimal, good-scoring alignments 10 0 1 2 3 4 5 6 7 8 9 How can you find - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1. Optimal 1 A 0 0 0 0 0 0 2 2 0 0 2 alignments if more than one 2 C 0 0 0 2 0 2 0 1 1 2 0 exist? 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 2. Non-optimal, good-scoring 5 A 0 0 0 0 2 3 4 3 1 1 3 alignments? 6 A 0 0 0 0 0 1 5 6 4 2 3 7 G 0 2 2 0 0 0 3 4 5 3 1 8 G 0 2 4 2 0 0 1 2 3 4 2 211

  13. Overlap alignment p Overlap matrix used by Overlap-Layout- Consensus algorithm can be computed with dynamic program ming p Initialization: O i,0 = O 0,j = 0 for all i, j p Recursion: O i,j = max { O i-1,j-1 + s(a i , b i ), O i-1,j – � , O i,j-1 – � , } Best overlap: maximum value from rightmost column and bottom row 212

  14. Non-uniform mismatch penalties We used uniform penalty for m ismatches: p s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A-> G, G-> A, C-> T, T-> C) are p approximately twice as frequent than transversions (A-> T, T-> A, A-> C, G-> T) use non-uniform mismatch n penalties collected into a substitution matrix A C G T A 1 -1 -0.5 -1 C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 213

  15. Gaps in alignment p Gap is a succession of indels in alignment C T – - - A A C T C G C A A p Previous model scored a length k gap as w(k) = -k � p Replication processes may produce longer stretches of insertions or deletions n In coding regions, insertions or deletions of codons may preserve functionality 214

  16. Gap open and extension penalties (2) p We can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = - � – � (k – 1) p Gap open cost � , Gap extension cost � p Alignment algorithms can be extended to use w(k) (not discussed on this course) 215

  17. Amino acid sequences p We have discussed mainly DNA sequences p Amino acid sequences can be aligned as well p However, the design of the substitution matrix is more involved because of the larger alphabet p More on the topic in the course Biological sequence analysis 216

  18. Demonstration of the EBI web site p European Bioinformatics Institute (EBI) offers many biological databases and bioinformatics tools at http: / / www.ebi.ac.uk/ n Sequence alignment: Tools -> Sequence Analysis -> Align 217

  19. Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 218

  20. Multiple alignment Consider a set of n sequences p aggcgagct gcgagt gct a on the right cgt t agat t gacgct gac Orthologous sequences from n t t ccggct gcgac different organisms gacacggcgaacgga Paralogs from multiple n duplications agt gt gcccgacgagcgaggac How can we study gcgggct gt gagcgct a p relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc 219

  21. Optimal alignment of three sequences p Alignment of A = a 1 a 2 … a i and B = b 1 b 2 … b j can end either in (-, b j ), (a i , b j ) or (a i , -) p 2 2 – 1 = 3 alternatives c k can end in 2 3 – p Alignment of A, B and C = c 1 c 2 … 1 ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) p Solve the recursion using three-dimensional dynamic programming matrix: O(n 3 ) time and space p Generalizes to n sequences but impractical with even a moderate number of sequences 220

  22. Multiple alignment in practice p In practice, real-world multiple alignment problems are usually solved with heuristics p Progressive multiple alignment n Choose two sequences and align them n Choose third sequence w.r.t. two previous sequences and align the third against them n Repeat until all sequences have been aligned n Different options how to choose sequences and score alignments n Note the similarity to Overlap-Layout-Consensus 221

  23. Multiple alignment in practice p Profile-based progressive multiple alignment: CLUSTALW n Construct a distance matrix of all pairs of sequences using dynamic programm ing n Progressively align pairs in order of decreasing similarity n CLUSTALW uses various heuristics to contribute to accuracy 222

  24. Additional material p R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis p N. C. Jones, P. A. Pevzner: An introduction to bioinformatics algorithms p Course Biological sequence analysis in period II, 2008 223

  25. Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 224

  26. The biological problem p Global and local alignment algoritms are slow in practice p Consider the scenario of aligning a query sequence against a large database of sequences n New sequence with unknown function n NCBI GenBank size in January 2007 was 65 369 091 950 bases (61 132 599 sequences) n Feb 2008: 85 759 586 764 bases (82 853 685 sequences) 225

  27. Problem with large amount of sequences p Exponential growth in both number and total length of sequences p Possible solution: Compare against model organisms only p With large amount of sequences, chances are that matches occur by random n Need for statistical analysis 226

  28. Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 227

  29. FASTA p FASTA is a multistep algorithm for sequence alignment (Wilbur and Lipman, 1983) p The sequence file format used by the FASTA software is widely used by other sequence analysis software p Main idea: n Choose regions of the two sequences (query and database) that look promising (have some degree of similarity) n Compute local alignment using dynamic programming in these regions 228

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend