computational methods in systems biology
play

Computational Methods in Systems Biology The hottest scientific - PDF document

What is Biology? A branch of knowledge that deals with living organisms and vital processes Computational Methods in Systems Biology The hottest scientific frontier of our times Many great processes have been figured out


  1. What is Biology?  “A branch of knowledge that deals with living organisms and vital processes” Computational Methods in Systems Biology  The hottest scientific frontier of our times • Many great processes have been figured out • Much is still unknown Nir Friedman Maya Schuldiner  Tremendous impact on Medicine • Both diagnosis, prognosis, and treatment . 2 Biological Systems are Complex Bakers Yeast Saccharomyces Cereviciae •The System is NOT just a sum of its parts •Used to make bread and beer •The simplest cell that still resembles human cells 3 4 The Age of Genomes What is Systems Biology? 404 Complete Microbial Genomes (Thousands in progress) “Systems biology is the study of the interactions between the 31 Complete Eukaryotic Genomes (315 in progress!) components of a biological system, and how these 3 Complete Plant Genomes (6 in progress) interactions give rise to the function and behavior of that system” • The last decades lead to revolution on how we can examine and understand biological systems Characterized by Bacteria Eukaryote Animal Human 1.6Mb 13Mb 100Mb 3Gb Individual Genomes? • High-throughput assays 1600 genes ~6000 genes ~20,000 genes ~30,000 genes? • Integration of multiple forms of experiments & knowledge • Mathematical modeling 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 5 6 1

  2. Ask Not What Systems Biology Can do For you…. . 8 Why Biology for NIPS Crowd? Flow of Information in Biology  Quantity DNA RNA Protein Phenotype • Data-intense discipline: Too vast for manual interpretation  Systematic • Collection of data on all genes/proteins/…  Multi-faceted • Measurements of complementary aspects of cellular function, development and disease states • Challenge of integration and fusion of multiple data Recipe Working The resulting The Review (in safe) copy dish Has the potential to be medically applicative! 9 10 The “Post-Genomic Era” Outline Systematic is Not Just More Assays DNA RNA Protein Phenotype DNA RNA Protein Phenotype  Genomic  Quantity  Quantity  Genetic  Stores genetically inherited information sequences interventions  Structure  Location  Sequence of four nucleotide types (A, C, G, T)  Variations  Environmental  Degradation  Modifications  Two complementary strands creating base pairs (bp) within a interventions rate  Interactions population  10 5 bp in bacteria, 3x10 9 in humans 6 X10 13 in wheat  …  …  …  … 11 12 2

  3. Understanding Genome Sequences ~3,289,000,000 characters: aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc Goal: aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt Identify components encoded in the DNA sequence ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca . . . 13 14 Open Reading Frame Finding Open Reading Frames ATGCTCAGCGTGACCTCA . . . CAGCGTTAA ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q R STP M L S V T S . . . Q R STP  Protein-encoding DNA sequence consists of a Try all possible starting points sequence of 3 letter codons  3 possible offsets  Starts with the START codon (ATG)  2 possible strands  Ends with a STOP codon (TAA, TAG, or TGA) Simple algorithm finds all ORFs in a genome  Many of these are spurious (are not real genes)  How do we focus on the real ones? 15 16 Using Additional Genomes Phylogentic Tree of Yeasts S. cerevisiae Basic premise ~10M years S. paradoxus “What is important is conserved” S. mikatae S. bayanus C. glabrata S. castellii Evolution = Variation + Selection K. lactis • Variation is random A. gossypii K. waltii • Selection reflects function D. hansenii C. albicans Y. lipolytica Idea: N. crassa  Instead of studying a single genome, compare M. graminearum related genomes M. grisea A. nidulans  A real open reading frame will be conserved S. pombe Kellis et al, Nature 2003 17 18 3

  4. Conserved Evolution of Open Reading Frame Examples Variable Frame shift Spurious ORF S. cerevisiae ATGCTCAGCGTGACCTCA . . . S. paradoxus ATGCTCAGCGTGACATCA . . . S. mikatae ATGCTCAGGGTGACA--A . . . ATG not S. bayanus ATGCTCAGG---ACA--A . . . conserved Frame shift Confirmed ORF Conserved changes interpretation positions of downstream seq Variable positions A deletion Greedy algorithm to find conserved ORFs surprisingly Sequencing effective (> 99% accuracy) on verified yeast data error [Kellis et al, Nature 2003] 19 20 Defining Conservation Probabilistic Model of Evolution Conserved Variable Naïve approach A A A C  Consensus between all A C A C species A A A C A G A A Problem: A T C C  Rough grained A C C A  Ignores distances between species A G C A Aardvark Bison Chimp Dog Elephant A G C A  Ignores the tree topology Random variables – sequence at current day taxa or C A T C at ancestors Goal : Potentials/Conditional distribution – represent the % conserv 100 33 55 55  More sensitive and robust probability of evolutionary changes along each methods branch 21 22 Parameterization of Phylogenies Conserved vs. unconserved Assumptions: Two hypotheses:  Positions (columns) are independent of each other  Each branch is a reversible continuous time discrete state Markov process P ( a c | t t ' ) P ( a b | t ) P ( b c | t ' ) � � + = � � b 2 3 4 1 2 3 4 1 P ( a ) P ( a b | t ) P ( b ) P ( b a | t ) � = � Conserved Unconserved Short branches Long branches governed by a rate matrix Q (fewer mutations) (more mutations) Q a , b = d dt P ( a � b | t ) P ( position | unconserve d ) t = 0 Use log P P ( a � b | t ) = e t Q [ ] a , b ( position | conserved ) [Boffelli et al, Science 2003] 23 24 4

  5. Genes Are Better Conserved Challenges Other types of genomic elements  Small polypeptides (peptohormones, neuropeptides) log Fast/Slow  RNA coding genes • rRNA, tRNA, snoRNA… • miRNA  Regulatory regions % conserved [Boffelli et al, Science 2003] 25 27 Transcription Factor Binding Sites Regulatory Elements  Relatively short words (6-20bp)  Recognition is not perfect • Binding sites allow variations  Often conserved *Essential Cell Biology; p.268 28 29 Challenges Outline Other types of genomic elements  Small polypeptides (peptohormones, neuropeptides)  RNA coding genes • rRNA, tRNA, snoRNA… DNA RNA Protein Phenotype • miRNA  Regulatory regions  Copied from DNA template  Conveys information (mRNA)  Can also perform function (tRNA, rRNA, …) Recognition of elements without comparisons  Single stranded, four nucleotide types (A,C, G, U)  Clearly sequence contains enough information to  For each expressed gene there can be as few as 1 molecule and up to 10,000 molecules per cell. “parse” it within the living cell 30 31 5

  6. Gene Expression High Throughput Gene Expression Transcription Translation Extract  Same DNA content RNA expression levels of 10,000s  Very different phenotype Microarray of genes in  Difference is in regulation of expression of genes one experiment 33 34 Dynamic Measurements Expression: Supervised Approaches Conditions Labeled samples  Time courses  Different perturbations (genetic & environmental)  Biopsies from different Feature selection patient populations + Genes Classification  … Classifier confidence  Potential diagnosis/prognosis tool P-value =< 0.027  Characterizes the disease state ⇒ insights about underlying processes Segman et al, Mol. Psych. 2005 35 Gasch et al. Mol. Cell 2001 36 Expression: Unsupervised Papers  Compendia 26 datasets from Whitehead and Stanford Various tumors Viral infection Stimulated B lymphoma PBMC Breast cancer Stimulated immune Fibroblast EWS/FLI Prostate PCA Cluster cancer Fibroblast infection Neuro tumors Fibroblast serum NCI60 Gliomas HeLa cell cycle Leukemia Lung cancer Liver cancer Eisen et al. PNAS 1998; Alter et al , PNAS 2000 Segal et al Nat. Gen. 2004 37 39 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend