Challenges of ancient genomics and pan-genomics Kay Nieselt Center - - PowerPoint PPT Presentation
Challenges of ancient genomics and pan-genomics Kay Nieselt Center - - PowerPoint PPT Presentation
Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen University of Tbingen Overview Introduction: ancient DNA (microbial) paleogenomics Challenges of computational paleogenomics
Overview
- Introduction:
- ancient DNA
- (microbial) paleogenomics
- Challenges of computational paleogenomics
2 |
Ancient DNA - Paleogenomics
With advent of NGS studying extinct organisms has become possible Most prominent extinct organisms are:
- Neandertals
- Other humans, e.g. denisovans
- Mammoth
- Lately also: bacteria and viruses
Field of Paleogenomics: reconstruction and analysis
- f ancient genomes
3 |
Microbial Paleogenomics
Motivation:
- Emergence of human diseases
- Evolution of Bacteria and Viruses
- Co-evolution with host
4 |
Computational Paleogenomics
- 1. Genome assembly:
- reference-based (reconstruction of genome
via mapping)
- de novo
- 2. SNP calling
- 3. Genome comparison
- 4. Phylogeny reconstruction
5 |
Ancient DNA - Workflow
6 | Source of Figure: Janet Kelso, Keynote BioVis 2016
Ancient DNA - Issues
- Short fragments:
mean fragment length between 60 and 150 bp
- Damaged:
aDNA chemically damages -> C->T (and G->A) conversion at 5’ (3’) end of fragment
- Metagenomes:
sample contains complex background (mixture
- f ancient and modern DNA)
- Contamination:
nature of complex background can also be due to contamination
- Low amount of endogenous DNA
7 |
Ancient DNA - Approaches
- Short fragments:
Produce paired-end reads longer than mean fragment length, after sequencing these are merged into a common single read
8 |
Forward read Reverse read Overlap Merged read
Ancient DNA - Approaches
- Damage:
used for authentication, afterwards often treated with UDG
9 |
Ancient DNA - Issues
- Metagenomes:
sample contains complex background (mixture
- f ancient and modern DNA)
- Possible approach: Use Malt 1 to characterize
taxonomic content
- Low amount of endogenous DNA:
Approach: specific enrichment for target species Problems: modern reference, no de novo assembly,…
10 |
1Herbig et al, bioRxiv 2016
EAGER1 – a fully automated (ancient) genome
reconstruction pipeline
11 |
Preprocessing
F a s t Q
Clip&Merge FastQC Read Mapping FastQ BWA, BWA-mem … Samtools DeDup QualiMap MapDamage Preseq CircularMapper SAM BAM Genotyping GATK: Preprocessing HaplotypeCaller UnifiedGenotyper VariantFiltration VCF2Genome FastA VCF Schmutzi ANGSD Genot. Likelih.
1Peltzer et al, Genome Biol. 2016
Challenge 1: Genome Reconstruction
Question: how to reconstruct the genome of the DNA of interest from the raw reads
- by mapping against a reference (only one?), or
- by de novo assembly, or
- by a hybrid approach?
12 |
Challenge 1: Genome Reconstruction
Approaches:
- by mapping against a reference:
Problems: 1) single-reference mapping, 2) for aDNA only modern references used
- by de novo assembly:
Problems: low coverage, almost never mate pairs, ... Possible approach: MADAM 1 – improved ancient DNA
genome assembly
- by hybrid of both: not aware how
13 |
1Seitz and Nieselt, PeerJ Preprints 2016
Challenge 2: Calling SNPs
To call SNPs: After mapping typical (best practice) approach:
- Apply tools such as GATK or freebayes or ANGSD
to mapping assemblies (i.e, bam files) Challenge: how to efficiently call SNPs from de novo assemblies when input is low coverage (ancient) DNA?
14 |
Challenge 3: Genome Comparison
Comparison of ancient and modern genomes:
- Large-scale variations: genomic rearrangements
(translocations, inversions)
- Small-scale variations: gene content, insertions,
deletions, SNPs
15 |
Challenge 3: Genome Comparison
Ancient genomes: mostly built from single common reference Modern genomes: assembled Challenge: how to compare?
16 |
Challenge 3: Genome Comparison
Comparative analyses based on genomic positions is challenging due to different coordinate systems across genomes. Possible approaches:
- compare genomes via one specific reference
But: genomic regions that cannot be aligned to the reference are lost.
- r
17 |
18 |
Build metareference that contains the superset
- f all genomes
SuperGenome1 (other call this pan-genome)
1Herbig et al, Bioinformatics 2012
SuperGenome
19
SuperGenome: Common coordinate system of all aligned genomes, independent of a prechosen reference genome, together with injective mapping of each genome onto SuperGenome Input: WGA of all known genomes of species of interest
The SuperGenome and Pan-Genome
Nice „side effect“ of SuperGenome: it allows
- consistent assignment of coordinates to genomic
annotations (e.g. genes)
- straightforward determination of pan-genome*
by determining the orthologs from overlapping coordinates in the SuperGenome
(* For us the pan-genome (aka supra-genome) is the full complement of all genes of a clade, e.g. species)
20 |
Challenge 3A: WGA
Sheer size of available genomes of a single species: currently up to thousands Challenge: How to compute a WGA of thousands of genomes?
21 |
Challenge 4: Phylogenetics
Phylogenetic trees from aDNA and modern genomes are reconstructed
- 1. to assess history of evolution,
- 2. to date `root’ and/or emergence of specific
clades.
22 |
Challenge 4: Phylogenetics
How reliable are phylogenetic trees built from genomic data reconstructed from short reads? Typical workflow:
- 1. map short reads to a single reference sequence,
- 2. extract single nucleotide polymorphisms (SNPs),
- 3. infer phylogenetic tree from the aligned SNP
positions.
23 |
Biased phylogenies? - I
- 1. Source of common alignment?
Reads mapped to one reference -> bias? Possible answer: REALPHY1 (Reference sequence Alignment-based Phylogeny builder)
24 |
1Bertels et al, Mol. Biol. Evol. 2014
Biased phylogenies? - II
- 2. SNPs versus whole genome-based phylogeny?
Bias of ML trees reconstructed from SNP-only positions shown by Bertels1 Possible approach to further investigate: TreeToReads2 - a pipeline for simulating raw reads from phylogenies
25 |
1Bertels et al, Mol. Biol. Evol. 2014 2McTavish et al, bioRxiv 2016
Biased phylogenies? - III
- 3. Influence of `Ns´on phylogeny?
Many aDNA samples suffer from low amount of endogenous DNA -> low coverage genomes -> missing positions (`Ns´) Question / Challenges:
- impact of `complete column deletion´versus
`partial column deletion´
- threshold of coverage for `good´phylogeny
26 |
Biased phylogenies? - IV
- 4. SNP-based phylogenies without alignment
and/or genome reconstruction? Possible approach: kSNP3.0 3
- 5. General challenge: Alignment-free versus
Alignment-based phylogenies? 4
27 |
3Gardner et al, Bioinformatics 2015 4Haubold, Brief. Bioinformatics 2013
Summary:
Challenges of ancient (pan-)genomics:
- 1. Reconstruction of ancient genomes from
bacteria (or viruses) via mapping or assembly or hybrid approach
- 2. SNP calling of low coverage genomes
- 3. WGA of ancient and modern genomes
- 4. Phylogenetic tree reconstruction from ancient
and modern genomes
28 |
References
Bertels F, Silander OK, Pachkov M, Rainey PB & van Nimwegen E (2014). Automated reconstruction of whole-genome phylogenies from short- sequence reads. Molecular biology and evolution, 31, 1077-1088. Gardner SN, Slezak T & Hall BG (2015). kSNP3. 0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference
- genome. Bioinformatics (Oxford, England), 31, 2877.
Haubold B (2014). Alignment-free phylogenetics and population genetics. Briefings in bioinformatics, 15, 407-418. Herbig A, Jäger G, Battke F & Nieselt K (2012). GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28: i7–i15 Herbig A, Maixner F, Bos KI, Zink A, Krause J & Huson DH (2016). MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, 050559. McTavish EJ, Pettengill J, Davis S, Rand H, Strain E, Allard M & Timme, RE (2016). TreeToReads-a pipeline for simulating raw reads from phylogenies. bioRxiv, 037655. Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J & Nieselt, K (2016). EAGER: efficient ancient genome reconstruction. Genome biology, 17, 1. Seitz A & Nieselt K, Improving ancient genome assembly. PeerJ Preprints 2016.
29 |