This week CSE 527 • Sequence alignment Computational Biology • More sequence alignment Autumn 2007 • Weekly “bio” interlude - DNA replication Lectures 2-3 Sequence Alignment; DNA Replication 1 Sequence Alignment • What • Why Sequence Alignment • A Simple Algorithm • Complexity Analysis Part I • A better Algorithm: Motivation, dynamic programming, “Dynamic Programming” global alignment 3 4 1

Sequence Similarity: What Sequence Similarity: What G G A C C A G G A C C A T A C T A A G T A C T A A G | : | : | | : T C C A A T T C C – A A T 5 6 BLAST Demo Try it! Sequence Similarity: Why http://www.ncbi.nlm.nih.gov/blast/ pick any protein, e.g. hemoglobin, insulin, exportin,… Taxonomy Report root ................................. 64 hits 16 orgs . Eukaryota .......................... 62 hits 14 orgs [cellular organisms] • Most widely used comp. tools in biology . . Fungi/Metazoa group .............. 57 hits 11 orgs . . . Bilateria ...................... 38 hits 7 orgs [Metazoa; Eumetazoa] . . . . Coelomata .................... 36 hits 6 orgs • New sequence always compared to . . . . . Tetrapoda .................. 26 hits 5 orgs [;;; Vertebrata;;;; Sarcopterygii] . . . . . . Eutheria ................. 24 hits 4 orgs [Amniota; Mammalia; Theria] . . . . . . . Homo sapiens ........... 20 hits 1 orgs [Primates;; Hominidae; Homo] sequence data bases . . . . . . . Murinae ................ 3 hits 2 orgs [Rodentia; Sciurognathi; Muridae] . . . . . . . . Rattus norvegicus .... 2 hits 1 orgs [Rattus] . . . . . . . . Mus musculus ......... 1 hits 1 orgs [Mus] . . . . . . . Sus scrofa ............. 1 hits 1 orgs [Cetartiodactyla; Suina; Suidae; Sus] Similar sequences often have similar . . . . . . Xenopus laevis ........... 2 hits 1 orgs [Amphibia;;;;;; Xenopodinae; Xenopus] . . . . . Drosophila melanogaster .... 10 hits 1 orgs [Protostomia;;;; Drosophila;;;] . . . . Caenorhabditis elegans ....... 2 hits 1 orgs [; Nematoda;;;;;; Caenorhabditis] origin or function . . . Ascomycota ..................... 19 hits 4 orgs [Fungi] . . . . Schizosaccharomyces pombe .... 10 hits 1 orgs [;;;; Schizosaccharomyces] . . . . Saccharomycetales ............ 9 hits 3 orgs [Saccharomycotina; Saccharomycetes] • Recognizable similarity after 10 8 –10 9 yr . . . . . Saccharomyces .............. 8 hits 2 orgs [Saccharomycetaceae] . . . . . . Saccharomyces cerevisiae . 7 hits 1 orgs . . . . . . Saccharomyces kluyveri ... 1 hits 1 orgs . . . . . Candida albicans ........... 1 hits 1 orgs [mitosporic Saccharomycetales;] . . Arabidopsis thaliana ............. 2 hits 1 orgs [Viridiplantae; …Brassicaceae;] . . Apicomplexa ...................... 3 hits 2 orgs [Alveolata] . . . Plasmodium falciparum .......... 2 hits 1 orgs [Haemosporida; Plasmodium] . . . Toxoplasma gondii .............. 1 hits 1 orgs [Coccidia; Eimeriida; Sarcocystidae;] . synthetic construct ................ 1 hits 1 orgs [other; artificial sequence] . lymphocystis disease virus ......... 1 hits 1 orgs [Viruses; dsDNA viruses, no RNA …] 7 8 2

Terminology (CS, not necessarily Bio) Sequence Alignment • String: ordered list of letters a c b c d b a c – – b c d b TATAAG • Prefix: consecutive letters from front c a d b d – c a d b – d – empty, T, TA, TAT, ... • Suffix: … from end Defn: An alignment of strings S, T is a empty, G, AG, AAG, ... pair of strings S’, T’ (with spaces) s.t. • Substring: … from ends or middle empty, TAT, AA, ... (1) |S’| = |T’|, and (|S| = “length of S”) • Subsequence: ordered, nonconsecutive (2) removing all spaces leaves S, T TT, AAA, TAG, ... 9 10 Optimal Alignment: Mismatch = -1 Match = 2 Alignment Scoring A Simple Algorithm a c b c d b a c - - b c d b for all subseqs A of S, B of T s.t. |A| = |B| do c a d b d - c a d b - d - align A[i] with B[i], 1 ≤ i ≤ |A| -1 2 -1 -1 2 -1 2 -1 align all other chars to spaces Value = 3*2 + 5*(-1) = +1 • The score of aligning (characters or compute its value S = abcd A = cd spaces) x & y is σ (x,y). T = wxyz B = xz retain the max | S ' | -abc-d a-bc-d ( S ' [ i ], T ' [ i ]) • Value of an alignment � = � end i 1 w--xyz -w-xyz • An optimal alignment: one of max value output the retained alignment 11 12 3

Polynomial vs Analysis Exponential Growth • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥ n � � � 2 n • How many alignments are there: � � n pick n chars of S,T together � � say k of them are in S match these k to the k un picked chars of T � � � n 2 n • Total time: � > 2 2 n , for n > 3 � n � � • E.g., for n = 20, time is > 2 40 operations 13 14 Asymptotic Analysis Big-O Example f(n) = O(g(n)) = • How does run time grow as a function of O(g’(n)) g(n) problem size? f(n) n 2 or 100 n 2 + 100 n + 100 vs 2 2n • Defn: f(n) = O(g(n)) iff there is a constant c s.t. |f(n)| ≤ cg(n) for all sufficiently large n. 100 n 2 + 100 n + 100 = O(n 2 ) [e.g. c = 300, or 101] g’(n) n 2 = O(2 2n ) 2 2n is not O(n 2 ) n → 15 16 4

Utility of Asymptotics Fibonacci Numbers • “All things being equal,” smaller asymptotic fib(n) { growth rate is better Simple recursion, if (n <= 1) { but many • All things are never equal return 1; repeated • Even so, big-O bounds often let you quickly subproblems!! } else { pick most promising candidates among => competing algorithms return fib(n-1) + fib(n-2); Time = Ω (1.61n) • Poly time algorithms often practical; } non-poly algorithms seldom are. } (Yes, there are exceptions.) 17 18 Candidate for Dynamic Fibonacci, II Programming? int fib[n]; • Common Subproblems? “Dynamic • Plausible: probably re-considering alignments of fib[0] = 1; Programming” various small substrings unless we're careful. fib[1] = 1; Avoid repeated work by • Optimal Substructure? tabulating solutions to for(i=2; i<=n; i++) { • Plausible: left and right "halves" of an optimal repeated subproblems alignment probably should be optimally aligned fib[i] = fib[i-1] + fib[i-2]; (though they obviously interact a bit at the => } interface). Time = O(n) (Both made rigorous below.) • return fib[n]; (in this case) 19 20 5

Optimal Alignment in O(n 2 ) Optimal Substructure via “Dynamic Programming” (In More Detail) • Optimal alignment ends in 1 of 3 ways: • Input: S, T, |S| = n, |T| = m • last chars of S & T aligned with each other • Output: value of optimal alignment • last char of S aligned with space in T Easier to solve a “harder” problem: • last char of T aligned with space in S • ( never align space with space; σ (–, –) < 0 ) V(i,j) = value of optimal alignment of • In each case, the rest of S & T should S[1], …, S[i] with T[1], …, T[j] be optimally aligned to each other for all 0 ≤ i ≤ n, 0 ≤ j ≤ m. 21 22 Base Cases General Case • V(i,0): first i chars of S all match spaces Opt align of S[1], …, S[i] vs T[1], …, T[j]: ~~~~ S [ i ] ~~~~ S [ i ] ~~~~ � � � � � � � i � , , or V ( i ,0) = � ( S [ k ], � ) � � � � � � ~~~~ T [ j ] ~~~~ � ~~~~ T [ j ] � � � � � � k = 1 Opt align of S 1 …S i-1 & � � V(i- 1 ,j- 1 ) + � ( S[i],T[j] ) T 1 …T j-1 • V(0,j): first j chars of T all match spaces � � V(i,j) = max � V(i- 1 ,j) + � ( S[i], - ) � , j � � � V (0, j ) = � ( � , T [ k ]) V(i,j- 1 ) + � ( - , T[j] ) � � k = 1 for all 1 i n , 1 j m . � � � � 23 24 6

Mismatch = -1 Match = 2 Calculating One Entry Example j 0 1 2 3 4 5 � V(i- 1 ,j- 1 ) + � ( S[i],T[j] ) � � � i c a d b d ← T V(i,j) = max V(i- 1 ,j) + � ( S[i], - ) � � � � 0 0 -1 -2 -3 -4 -5 V(i,j- 1 ) + � ( - , T[j] ) � � 1 a -1 -1 1 T[j] 2 c -2 1 Time = : 3 b -3 O(mn) 4 c -4 V(i-1,j-1) V(i-1,j) 5 d -5 6 b -6 S[i] . . V(i,j-1) V(i,j) ↑ 25 S 26 Finding Alignments: Mismatch = -1 Match = 2 Example Trace Back j 0 1 2 3 4 5 j 0 1 2 3 4 5 i c a d b d ← T i c a d b d ← T 0 0 -1 -2 -3 -4 -5 0 0 -1 -2 -3 -4 -5 1 a -1 -1 1 0 -1 -2 1 a -1 -1 1 0 -1 -2 2 c -2 1 0 0 -1 -2 2 c -2 1 0 0 -1 -2 3 b -3 0 0 -1 2 1 3 b -3 0 0 -1 2 1 4 c -4 -1 -1 -1 1 1 4 c -4 -1 -1 -1 1 1 5 d -5 -2 -2 1 0 3 5 d -5 -2 -2 1 0 3 6 b -6 -3 -3 0 3 2 6 b -6 -3 -3 0 3 2 ↑ ↑ S 27 S 28 7

Recommend

More recommend