applications of machine learning in computational biology
play

Applications of Machine Learning in Computational Biology Narges - PowerPoint PPT Presentation

Applications of Machine Learning in Computational Biology Narges Razavian New York University Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop @ ECCV


  1. Applications of Machine Learning in Computational Biology Narges Razavian New York University Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop @ ECCV 2004

  2. Central Dogma of Biology

  3. Examples of Challenges involved Slide Credit: Manolis Kellis

  4. Application : Decoding Sequences and Motif Discovery

  5. Motif Discovery GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

  6. Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG Gene ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

  7. Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT Promoter TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT Motif GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG Gene ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

  8. A Generative Model 0.15 Background Island 0.85 0.75 0.25 A: 0.25 A: 0.15 T: 0.25 T: 0.13 G: 0.25 G: 0.30 C: 0.25 C: 0.42 TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

  9. A Generative Model(cont.) P P P P P P P P P P P P P P P P B B B B B B B B B B B B B B B B S: G C A A A T G C P(L i+1 |L i ) P(S|B) P(S|P) B i+1 P i+1 A: 0.25 A: 0.42 T: 0.25 T: 0.30 0.85 0.15 B i G: 0.25 G: 0.13 0.25 0.75 P i C: 0.25 C: 0.15

  10. Fundamental HMM Operations Computation Biology Decoding Annotate pathogenicity islands on • Given an HMM and sequence S a new sequence • Find a corresponding sequence of labels, L Evaluation • Given an HMM and sequence S Score a particular sequence (not • Find P(S|HMM) as useful for this model – will come back to this later) Training • Given an HMM w/o parameters Learn a model for sequence and set of sequences S composed of background DNA • Find transition and emission and pathogenicity islands probabilities the maximize P(S | params, HMM)

  11. Application: Modeling Protein Families

  12. Modeling Protein Families • Given amino acid sequences from a protein family, how can we find other members? – Can search databases with each known member – not sensitive – More information is contained in full set • The HMM Profile Approach – Learn the statistical features of protein family – Model these features with an HMM – Search for new members by scoring with HMM

  13. Human Ubiquitin Conjugating Enzymes UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS

  14. Profile HMM A A C C D D E E F F G G H H D 1 D j D N I I K K L L M M N N O O P P Q Q R R I I 1 I j I N S S T T V V W W Y Y Start M 1 M j M N End E2EPF5 LG K D F PA S PP K G YF L T K I F H P N VGA N - G E ICV N VL KR A------------ D W T A E LGI RH UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K - G Q VCLPVI S A A----------- E N W K PA T K T D Q UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N - G Q ICLPII SS A----------- E N W K PC T K T C Q UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A S G T VCL D VI N -P----------- QT W T AL Y D L TN

  15. Using Profile HMMs Computation Biology Decoding Find sequence of labels, L, Align a new sequence to a protein that maximizes family P(L|S, HMM) Evaluation • Find P(S|HMM) Score a sequence for membership in family Training • Find transition and emission Discover and model family probabilities the maximize structure P(S | params, HMM)

  16. Application: Modeling Protein Dynamics

  17. Background • Proteins: Molecular machines, composed of a sequences of Amino Acid sub-units

  18. Background: • Protein functional analysis pipeline Crystallize to Molecular Learn Analyze Get X-Ray Dynamics Probabilistic and Predict Snapshot Simulations Model 20 Image: H khanlou , et.al. “Durable Efficacy and Continued Safety of Ibalizumab in Treatment- Experienced Patients”, Infectious Diseases Society of America (IDSA) October 2011

  19. Modeling Protein Tertiary Structure

  20. 10 second Reminder! Probability Theory • Sum rule • Product rule • From these we have Bayes’ theorem – with normalization

  21. 10 second Reminder(cont.)! Decomposition • Consider an arbitrary joint distribution • By successive application of the product rule

  22. Directed Acyclic Graphs • Joint distribution where denotes the parents of i No directed cycles

  23. Undirected Graphs • Provided then joint distribution is product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend