Challenges of ancient genomics and pan-genomics Kay Nieselt Center - - PowerPoint PPT Presentation

challenges of ancient genomics and pan genomics
SMART_READER_LITE
LIVE PREVIEW

Challenges of ancient genomics and pan-genomics Kay Nieselt Center - - PowerPoint PPT Presentation

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen University of Tbingen Overview Introduction: ancient DNA (microbial) paleogenomics Challenges of computational paleogenomics


slide-1
SLIDE 1

Challenges of ancient genomics and pan-genomics

Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

slide-2
SLIDE 2

Overview

  • Introduction:
  • ancient DNA
  • (microbial) paleogenomics
  • Challenges of computational paleogenomics

2 |

slide-3
SLIDE 3

Ancient DNA - Paleogenomics

With advent of NGS studying extinct organisms has become possible Most prominent extinct organisms are:

  • Neandertals
  • Other humans, e.g. denisovans
  • Mammoth
  • Lately also: bacteria and viruses

Field of Paleogenomics: reconstruction and analysis

  • f ancient genomes

3 |

slide-4
SLIDE 4

Microbial Paleogenomics

Motivation:

  • Emergence of human diseases
  • Evolution of Bacteria and Viruses
  • Co-evolution with host

4 |

slide-5
SLIDE 5

Computational Paleogenomics

  • 1. Genome assembly:
  • reference-based (reconstruction of genome

via mapping)

  • de novo
  • 2. SNP calling
  • 3. Genome comparison
  • 4. Phylogeny reconstruction

5 |

slide-6
SLIDE 6

Ancient DNA - Workflow

6 | Source of Figure: Janet Kelso, Keynote BioVis 2016

slide-7
SLIDE 7

Ancient DNA - Issues

  • Short fragments:

mean fragment length between 60 and 150 bp

  • Damaged:

aDNA chemically damages -> C->T (and G->A) conversion at 5’ (3’) end of fragment

  • Metagenomes:

sample contains complex background (mixture

  • f ancient and modern DNA)
  • Contamination:

nature of complex background can also be due to contamination

  • Low amount of endogenous DNA

7 |

slide-8
SLIDE 8

Ancient DNA - Approaches

  • Short fragments:

Produce paired-end reads longer than mean fragment length, after sequencing these are merged into a common single read

8 |

Forward read Reverse read Overlap Merged read

slide-9
SLIDE 9

Ancient DNA - Approaches

  • Damage:

used for authentication, afterwards often treated with UDG

9 |

slide-10
SLIDE 10

Ancient DNA - Issues

  • Metagenomes:

sample contains complex background (mixture

  • f ancient and modern DNA)
  • Possible approach: Use Malt 1 to characterize

taxonomic content

  • Low amount of endogenous DNA:

Approach: specific enrichment for target species Problems: modern reference, no de novo assembly,…

10 |

1Herbig et al, bioRxiv 2016

slide-11
SLIDE 11

EAGER1 – a fully automated (ancient) genome

reconstruction pipeline

11 |

Preprocessing

F a s t Q

Clip&Merge FastQC Read Mapping FastQ BWA, BWA-mem … Samtools DeDup QualiMap MapDamage Preseq CircularMapper SAM BAM Genotyping GATK: Preprocessing HaplotypeCaller UnifiedGenotyper VariantFiltration VCF2Genome FastA VCF Schmutzi ANGSD Genot. Likelih.

1Peltzer et al, Genome Biol. 2016

slide-12
SLIDE 12

Challenge 1: Genome Reconstruction

Question: how to reconstruct the genome of the DNA of interest from the raw reads

  • by mapping against a reference (only one?), or
  • by de novo assembly, or
  • by a hybrid approach?

12 |

slide-13
SLIDE 13

Challenge 1: Genome Reconstruction

Approaches:

  • by mapping against a reference:

Problems: 1) single-reference mapping, 2) for aDNA only modern references used

  • by de novo assembly:

Problems: low coverage, almost never mate pairs, ... Possible approach: MADAM 1 – improved ancient DNA

genome assembly

  • by hybrid of both: not aware how

13 |

1Seitz and Nieselt, PeerJ Preprints 2016

slide-14
SLIDE 14

Challenge 2: Calling SNPs

To call SNPs: After mapping typical (best practice) approach:

  • Apply tools such as GATK or freebayes or ANGSD

to mapping assemblies (i.e, bam files) Challenge: how to efficiently call SNPs from de novo assemblies when input is low coverage (ancient) DNA?

14 |

slide-15
SLIDE 15

Challenge 3: Genome Comparison

Comparison of ancient and modern genomes:

  • Large-scale variations: genomic rearrangements

(translocations, inversions)

  • Small-scale variations: gene content, insertions,

deletions, SNPs

15 |

slide-16
SLIDE 16

Challenge 3: Genome Comparison

Ancient genomes: mostly built from single common reference Modern genomes: assembled Challenge: how to compare?

16 |

slide-17
SLIDE 17

Challenge 3: Genome Comparison

Comparative analyses based on genomic positions is challenging due to different coordinate systems across genomes. Possible approaches:

  • compare genomes via one specific reference

But: genomic regions that cannot be aligned to the reference are lost.

  • r

17 |

slide-18
SLIDE 18

18 |

Build metareference that contains the superset

  • f all genomes

SuperGenome1 (other call this pan-genome)

1Herbig et al, Bioinformatics 2012

slide-19
SLIDE 19

SuperGenome

19

SuperGenome: Common coordinate system of all aligned genomes, independent of a prechosen reference genome, together with injective mapping of each genome onto SuperGenome Input: WGA of all known genomes of species of interest

slide-20
SLIDE 20

The SuperGenome and Pan-Genome

Nice „side effect“ of SuperGenome: it allows

  • consistent assignment of coordinates to genomic

annotations (e.g. genes)

  • straightforward determination of pan-genome*

by determining the orthologs from overlapping coordinates in the SuperGenome

(* For us the pan-genome (aka supra-genome) is the full complement of all genes of a clade, e.g. species)

20 |

slide-21
SLIDE 21

Challenge 3A: WGA

Sheer size of available genomes of a single species: currently up to thousands Challenge: How to compute a WGA of thousands of genomes?

21 |

slide-22
SLIDE 22

Challenge 4: Phylogenetics

Phylogenetic trees from aDNA and modern genomes are reconstructed

  • 1. to assess history of evolution,
  • 2. to date `root’ and/or emergence of specific

clades.

22 |

slide-23
SLIDE 23

Challenge 4: Phylogenetics

How reliable are phylogenetic trees built from genomic data reconstructed from short reads? Typical workflow:

  • 1. map short reads to a single reference sequence,
  • 2. extract single nucleotide polymorphisms (SNPs),
  • 3. infer phylogenetic tree from the aligned SNP

positions.

23 |

slide-24
SLIDE 24

Biased phylogenies? - I

  • 1. Source of common alignment?

Reads mapped to one reference -> bias? Possible answer: REALPHY1 (Reference sequence Alignment-based Phylogeny builder)

24 |

1Bertels et al, Mol. Biol. Evol. 2014

slide-25
SLIDE 25

Biased phylogenies? - II

  • 2. SNPs versus whole genome-based phylogeny?

Bias of ML trees reconstructed from SNP-only positions shown by Bertels1 Possible approach to further investigate: TreeToReads2 - a pipeline for simulating raw reads from phylogenies

25 |

1Bertels et al, Mol. Biol. Evol. 2014 2McTavish et al, bioRxiv 2016

slide-26
SLIDE 26

Biased phylogenies? - III

  • 3. Influence of `Ns´on phylogeny?

Many aDNA samples suffer from low amount of endogenous DNA -> low coverage genomes -> missing positions (`Ns´) Question / Challenges:

  • impact of `complete column deletion´versus

`partial column deletion´

  • threshold of coverage for `good´phylogeny

26 |

slide-27
SLIDE 27

Biased phylogenies? - IV

  • 4. SNP-based phylogenies without alignment

and/or genome reconstruction? Possible approach: kSNP3.0 3

  • 5. General challenge: Alignment-free versus

Alignment-based phylogenies? 4

27 |

3Gardner et al, Bioinformatics 2015 4Haubold, Brief. Bioinformatics 2013

slide-28
SLIDE 28

Summary:

Challenges of ancient (pan-)genomics:

  • 1. Reconstruction of ancient genomes from

bacteria (or viruses) via mapping or assembly or hybrid approach

  • 2. SNP calling of low coverage genomes
  • 3. WGA of ancient and modern genomes
  • 4. Phylogenetic tree reconstruction from ancient

and modern genomes

28 |

slide-29
SLIDE 29

References

Bertels F, Silander OK, Pachkov M, Rainey PB & van Nimwegen E (2014). Automated reconstruction of whole-genome phylogenies from short- sequence reads. Molecular biology and evolution, 31, 1077-1088. Gardner SN, Slezak T & Hall BG (2015). kSNP3. 0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference

  • genome. Bioinformatics (Oxford, England), 31, 2877.

Haubold B (2014). Alignment-free phylogenetics and population genetics. Briefings in bioinformatics, 15, 407-418. Herbig A, Jäger G, Battke F & Nieselt K (2012). GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28: i7–i15 Herbig A, Maixner F, Bos KI, Zink A, Krause J & Huson DH (2016). MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, 050559. McTavish EJ, Pettengill J, Davis S, Rand H, Strain E, Allard M & Timme, RE (2016). TreeToReads-a pipeline for simulating raw reads from phylogenies. bioRxiv, 037655. Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J & Nieselt, K (2016). EAGER: efficient ancient genome reconstruction. Genome biology, 17, 1. Seitz A & Nieselt K, Improving ancient genome assembly. PeerJ Preprints 2016.

29 |