this week cse 527
play

This week CSE 527 Sequence alignment Computational Biology More - PowerPoint PPT Presentation

This week CSE 527 Sequence alignment Computational Biology More sequence alignment Autumn 2007 Weekly bio interlude - DNA replication Lectures 2-3 Sequence Alignment; DNA Replication 1 Sequence Alignment What Why


  1. This week CSE 527 • Sequence alignment Computational Biology • More sequence alignment Autumn 2007 • Weekly “bio” interlude - DNA replication Lectures 2-3 Sequence Alignment; DNA Replication 1 Sequence Alignment • What • Why Sequence Alignment • A Simple Algorithm • Complexity Analysis Part I • A better Algorithm: Motivation, dynamic programming, “Dynamic Programming” global alignment 3 4 1

  2. Sequence Similarity: What Sequence Similarity: What G G A C C A G G A C C A T A C T A A G T A C T A A G | : | : | | : T C C A A T T C C – A A T 5 6 BLAST Demo Try it! Sequence Similarity: Why http://www.ncbi.nlm.nih.gov/blast/ pick any protein, e.g. hemoglobin, insulin, exportin,… Taxonomy Report root ................................. 64 hits 16 orgs . Eukaryota .......................... 62 hits 14 orgs [cellular organisms] • Most widely used comp. tools in biology . . Fungi/Metazoa group .............. 57 hits 11 orgs . . . Bilateria ...................... 38 hits 7 orgs [Metazoa; Eumetazoa] . . . . Coelomata .................... 36 hits 6 orgs • New sequence always compared to . . . . . Tetrapoda .................. 26 hits 5 orgs [;;; Vertebrata;;;; Sarcopterygii] . . . . . . Eutheria ................. 24 hits 4 orgs [Amniota; Mammalia; Theria] . . . . . . . Homo sapiens ........... 20 hits 1 orgs [Primates;; Hominidae; Homo] sequence data bases . . . . . . . Murinae ................ 3 hits 2 orgs [Rodentia; Sciurognathi; Muridae] . . . . . . . . Rattus norvegicus .... 2 hits 1 orgs [Rattus] . . . . . . . . Mus musculus ......... 1 hits 1 orgs [Mus] . . . . . . . Sus scrofa ............. 1 hits 1 orgs [Cetartiodactyla; Suina; Suidae; Sus] Similar sequences often have similar . . . . . . Xenopus laevis ........... 2 hits 1 orgs [Amphibia;;;;;; Xenopodinae; Xenopus] . . . . . Drosophila melanogaster .... 10 hits 1 orgs [Protostomia;;;; Drosophila;;;] . . . . Caenorhabditis elegans ....... 2 hits 1 orgs [; Nematoda;;;;;; Caenorhabditis] origin or function . . . Ascomycota ..................... 19 hits 4 orgs [Fungi] . . . . Schizosaccharomyces pombe .... 10 hits 1 orgs [;;;; Schizosaccharomyces] . . . . Saccharomycetales ............ 9 hits 3 orgs [Saccharomycotina; Saccharomycetes] • Recognizable similarity after 10 8 –10 9 yr . . . . . Saccharomyces .............. 8 hits 2 orgs [Saccharomycetaceae] . . . . . . Saccharomyces cerevisiae . 7 hits 1 orgs . . . . . . Saccharomyces kluyveri ... 1 hits 1 orgs . . . . . Candida albicans ........... 1 hits 1 orgs [mitosporic Saccharomycetales;] . . Arabidopsis thaliana ............. 2 hits 1 orgs [Viridiplantae; …Brassicaceae;] . . Apicomplexa ...................... 3 hits 2 orgs [Alveolata] . . . Plasmodium falciparum .......... 2 hits 1 orgs [Haemosporida; Plasmodium] . . . Toxoplasma gondii .............. 1 hits 1 orgs [Coccidia; Eimeriida; Sarcocystidae;] . synthetic construct ................ 1 hits 1 orgs [other; artificial sequence] . lymphocystis disease virus ......... 1 hits 1 orgs [Viruses; dsDNA viruses, no RNA …] 7 8 2

  3. Terminology (CS, not necessarily Bio) Sequence Alignment • String: ordered list of letters a c b c d b a c – – b c d b TATAAG • Prefix: consecutive letters from front c a d b d – c a d b – d – empty, T, TA, TAT, ... • Suffix: … from end Defn: An alignment of strings S, T is a empty, G, AG, AAG, ... pair of strings S’, T’ (with spaces) s.t. • Substring: … from ends or middle empty, TAT, AA, ... (1) |S’| = |T’|, and (|S| = “length of S”) • Subsequence: ordered, nonconsecutive (2) removing all spaces leaves S, T TT, AAA, TAG, ... 9 10 Optimal Alignment: Mismatch = -1 Match = 2 Alignment Scoring A Simple Algorithm a c b c d b a c - - b c d b for all subseqs A of S, B of T s.t. |A| = |B| do c a d b d - c a d b - d - align A[i] with B[i], 1 ≤ i ≤ |A| -1 2 -1 -1 2 -1 2 -1 align all other chars to spaces Value = 3*2 + 5*(-1) = +1 • The score of aligning (characters or compute its value S = abcd A = cd spaces) x & y is σ (x,y). T = wxyz B = xz retain the max | S ' | -abc-d a-bc-d ( S ' [ i ], T ' [ i ]) • Value of an alignment � = � end i 1 w--xyz -w-xyz • An optimal alignment: one of max value output the retained alignment 11 12 3

  4. Polynomial vs Analysis Exponential Growth • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥ n � � � 2 n • How many alignments are there: � � n pick n chars of S,T together � � say k of them are in S match these k to the k un picked chars of T � � � n 2 n • Total time: � > 2 2 n , for n > 3 � n � � • E.g., for n = 20, time is > 2 40 operations 13 14 Asymptotic Analysis Big-O Example f(n) = O(g(n)) = • How does run time grow as a function of O(g’(n)) g(n) problem size? f(n) n 2 or 100 n 2 + 100 n + 100 vs 2 2n • Defn: f(n) = O(g(n)) iff there is a constant c s.t. |f(n)| ≤ cg(n) for all sufficiently large n. 100 n 2 + 100 n + 100 = O(n 2 ) [e.g. c = 300, or 101] g’(n) n 2 = O(2 2n ) 2 2n is not O(n 2 ) n → 15 16 4

  5. Utility of Asymptotics Fibonacci Numbers • “All things being equal,” smaller asymptotic fib(n) { growth rate is better Simple recursion, if (n <= 1) { but many • All things are never equal return 1; repeated • Even so, big-O bounds often let you quickly subproblems!! } else { pick most promising candidates among => competing algorithms return fib(n-1) + fib(n-2); Time = Ω (1.61n) • Poly time algorithms often practical; } non-poly algorithms seldom are. } (Yes, there are exceptions.) 17 18 Candidate for Dynamic Fibonacci, II Programming? int fib[n]; • Common Subproblems? “Dynamic • Plausible: probably re-considering alignments of fib[0] = 1; Programming” various small substrings unless we're careful. fib[1] = 1; Avoid repeated work by • Optimal Substructure? tabulating solutions to for(i=2; i<=n; i++) { • Plausible: left and right "halves" of an optimal repeated subproblems alignment probably should be optimally aligned fib[i] = fib[i-1] + fib[i-2]; (though they obviously interact a bit at the => } interface). Time = O(n) (Both made rigorous below.) • return fib[n]; (in this case) 19 20 5

  6. Optimal Alignment in O(n 2 ) Optimal Substructure via “Dynamic Programming” (In More Detail) • Optimal alignment ends in 1 of 3 ways: • Input: S, T, |S| = n, |T| = m • last chars of S & T aligned with each other • Output: value of optimal alignment • last char of S aligned with space in T Easier to solve a “harder” problem: • last char of T aligned with space in S • ( never align space with space; σ (–, –) < 0 ) V(i,j) = value of optimal alignment of • In each case, the rest of S & T should S[1], …, S[i] with T[1], …, T[j] be optimally aligned to each other for all 0 ≤ i ≤ n, 0 ≤ j ≤ m. 21 22 Base Cases General Case • V(i,0): first i chars of S all match spaces Opt align of S[1], …, S[i] vs T[1], …, T[j]: ~~~~ S [ i ] ~~~~ S [ i ] ~~~~ � � � � � � � i � , , or V ( i ,0) = � ( S [ k ], � ) � � � � � � ~~~~ T [ j ] ~~~~ � ~~~~ T [ j ] � � � � � � k = 1 Opt align of S 1 …S i-1 & � � V(i- 1 ,j- 1 ) + � ( S[i],T[j] ) T 1 …T j-1 • V(0,j): first j chars of T all match spaces � � V(i,j) = max � V(i- 1 ,j) + � ( S[i], - ) � , j � � � V (0, j ) = � ( � , T [ k ]) V(i,j- 1 ) + � ( - , T[j] ) � � k = 1 for all 1 i n , 1 j m . � � � � 23 24 6

  7. Mismatch = -1 Match = 2 Calculating One Entry Example j 0 1 2 3 4 5 � V(i- 1 ,j- 1 ) + � ( S[i],T[j] ) � � � i c a d b d ← T V(i,j) = max V(i- 1 ,j) + � ( S[i], - ) � � � � 0 0 -1 -2 -3 -4 -5 V(i,j- 1 ) + � ( - , T[j] ) � � 1 a -1 -1 1 T[j] 2 c -2 1 Time = : 3 b -3 O(mn) 4 c -4 V(i-1,j-1) V(i-1,j) 5 d -5 6 b -6 S[i] . . V(i,j-1) V(i,j) ↑ 25 S 26 Finding Alignments: Mismatch = -1 Match = 2 Example Trace Back j 0 1 2 3 4 5 j 0 1 2 3 4 5 i c a d b d ← T i c a d b d ← T 0 0 -1 -2 -3 -4 -5 0 0 -1 -2 -3 -4 -5 1 a -1 -1 1 0 -1 -2 1 a -1 -1 1 0 -1 -2 2 c -2 1 0 0 -1 -2 2 c -2 1 0 0 -1 -2 3 b -3 0 0 -1 2 1 3 b -3 0 0 -1 2 1 4 c -4 -1 -1 -1 1 1 4 c -4 -1 -1 -1 1 1 5 d -5 -2 -2 1 0 3 5 d -5 -2 -2 1 0 3 6 b -6 -3 -3 0 3 2 6 b -6 -3 -3 0 3 2 ↑ ↑ S 27 S 28 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend