comparative genomics computational challenges
play

Comparative Genomics: Computational Challenges Bernard M.E. Moret - PowerPoint PPT Presentation

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 p. Overview Comparative approaches The genome and its evolution High-throughput data


  1. Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 – p.

  2. Overview � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  3. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  4. Comparative Approaches � an evolutionary perspective requires models of evolution Nothing makes sense in biology except in the light of evolution � a model of evolution requires data that reveals evolution Th. Dobzhansky, The American Biology Teacher , 1973. � data that reveals evolution must come from several organisms or tissues � hence working in the light of evolution requires comparative approaches � some organisms, incl. humans, are difficult to study in a lab setting � some experiments cannot be performed on some organisms, incl. humans, From the point of view of experimentalists and medical researchers � hence learning about these organisms is best done by studying others for practical or ethical reasons Nantes, 6/8/09 – p.

  5. Characteristics of Comparative Approaches � rely on identification of conserved patterns � develop evolutionary models for observed changes Comparative approaches � use conserved patterns for datamining and evolutionary models for analysis of mined data Translated to comparative genomics: data: whole-genome sequences conserved patterns: subsequences, distribution statistics, or combinations models: duplication/loss/rearrangement at the genome level, mutation/indel for nucleotides datamining: important components, anchors, clusters, syntenic blocks Nantes, 6/8/09 – p.

  6. Comparative Genomics Vocabulary syntenic block : conserved pattern (subject to microrearrangements) used to denote conserved block of genes (10Kbps to 1Mbps) (originally used in genetics to denote colocation on the same chromosome) genomic alignment : sequence-level alignment of complete genomes, with block-level rearrangements, duplications, and losses positive (Darwinian) selection : selects for favorable traits; accelerates observed change in affected regions negative (filtering) selection : selects against changes of most kinds; slows down observed change in affected regions ancestral reconstruction : inference of the putative contents and arrangement of the genome of a common ancestor genomic signature : originally, distribution of dinucleotide frequencies, now characterization of common patterns in a group of genomes Nantes, 6/8/09 – p.

  7. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  8. Evolution of the Genome What evolutionary events affect the genome? nucleotide-level : “classical” sequence evolution (mutations and indels) genomic rearrangements : inversions, transpositions, translocations, and chromosomal fusion and fission duplication : gene retrotransposition, tandem duplication, segmental duplication, whole-genome duplication loss : point mutation, segmental deletion, neofunctionalization recombination : meiotic recombination, hybridization, lateral gene transfer Nantes, 6/8/09 – p.

  9. Evolutionary Models And how well understood are they? nucleotide-level : well established models with good statistics genomic rearrangements : enormous work in the last 10 years, but still parameter-poor duplication/loss : established work in lineage sorting (divergent gene evolution due to paralogs), much attention to whole-genome duplication, just starting on segmental duplications recombination : established work in population genetics, much work on identifying lateral gene transfer, detailed work on recombination just starting Nantes, 6/8/09 – p.

  10. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

  11. High-throughput data Slowly pervading laboratory-based biology: sequencing : the original high-throughput data source, now easily the most economical high-tech lab instrument gene expression : microarrays and their ilk, now inexpensive and in very widespread use transcription profiling : ChIP-chip, ChIP-seq, and future products, soon to replace microarrays mass spectrometry : for protein analysis and sequencing; now also for mixed samples (metaproteomics) SNP assays : for precise genotyping of humans other domains : cell signalling, metabolomics (e.g., fluxes), 3D imaging, time series, etc. Nantes, 6/8/09 – p. 1

  12. High-throughput sequencing developed for the human genome project, now also used for: de novo sequencing : still the most challenging resequencing : verify base calls, test assemblies, etc. deep sequencing : dense sampling or high coverage metagenomics : random sampling of microbial communities current technologies (454, Illumina) generate around 4Gbps per half-day run next-gen technologies may yield over 20Gbps per hour run, at better than 50x coverage, with very short (20bps) fragments Nantes, 6/8/09 – p. 1

  13. Can computation keep up? Computer power still follows Moore’s law, doubling every 12–15 months. However: That power is getting harder to use (parallelism is hard to exploit). Data accumulates faster than Moore’s law (sequence data alone doubles every year). 3 – 10 4 speedup. This comparison presupposes a linear relationship, 10 but most genomic analysis algorithms are much slower. High-performance computing does not help much: the fastest machines can provide only Nantes, 6/8/09 – p. 1

  14. Genome-scale computing Running time is only one facet of the problem. Comparing several genomes of a few Gbps each requires a lot of memory —at least 128GB per node. Available and not too expensive: a 16-core compute node with 128GB memory and 500GB disk can be had for $20K. Still rare: most compute clusters have “thin” nodes (2-8GB of memory), unsuited to whole-genome analysis. 2010 architectures will pack 64–128 cores per node, with 0.5–1TB memory—great for comparing a few genomes, but by then we will have 100s of vertebrate genomes. . . Nantes, 6/8/09 – p. 1

  15. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

  16. Basic annotation per genome We want to identify: all coding genes (or exons); all noncoding genes; gene families; SINEs, LINEs, and other repeat elements; regions under positive, neutral, or negative selection. Beyond this basic level, we want to identify gene clusters, operons, alternative splicing scenarios, etc.; gene function. Nantes, 6/8/09 – p. 1

  17. Comparative annotation Using pairwise comparative approaches, we can ask for all pairwise homologies; orthology and paralogy relationships within gene families (within limits due to lack of phylogenetic information); syntenic blocks; mapping of the syntenic blocks between the two genomes (simple translocations and transpositions, with inversions); translation of functional annotations from each genome into the other. Nantes, 6/8/09 – p. 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend