Fast and Accurate Distance Computation from Unaligned Genomes - - PowerPoint PPT Presentation

fast and accurate distance computation from unaligned
SMART_READER_LITE
LIVE PREVIEW

Fast and Accurate Distance Computation from Unaligned Genomes - - PowerPoint PPT Presentation

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl & Bernhard Haubold GCB 2018 MPI for Evolutionary Biology, Pln mpg.png ACCGGTGTGCT ACCGGTGTGCT >D AACGATGCG-T >C CACGTT--GGT >B AACGTTGTGCA


slide-1
SLIDE 1

Fast and Accurate Distance Computation from Unaligned Genomes

Fabian Klötzl & Bernhard Haubold GCB 2018

MPI for Evolutionary Biology, Plön

slide-2
SLIDE 2

mpg.png

Alignment-Based Phylogeny Reconstruction

Unaligned Sequences Alignment Phylogeny >A AACGTTGTGCA >B CACGTTGGT >C AACGATGCGT >D ACCGGTGTGCT ⇒ >A AACGTTGTGCA >B CACGTT--GGT >C AACGATGCG-T >D ACCGGTGTGCT ⇒ A B C D

1

slide-3
SLIDE 3

mpg.png

Alignment-Free Phylogeny Reconstruction

Unaligned Sequences Distance Matrix Phylogeny >A AACGTTGTGCA >B CACGTTGGT >C AACGATGCGT >D ACCGGTGTGCT ⇒      0.1 0.25 0.3 0.1 0.3 0.3 0.25 0.3 0.05 0.3 0.3 0.05      ⇒ A B C D

2

slide-4
SLIDE 4

mpg.png

Phylonium

  • 1. Use one sequence as the common coordinate system.
  • 2. Align all other sequences against this reference.
  • 3. For all pairs inspect the overlapping regions.
  • 4. Estimate evolutionary distance from substitution rate.

Reference Q1 Q1 Q2 Q2

3

slide-5
SLIDE 5

mpg.png

Phylonium

  • 1. Use one sequence as the common coordinate system.
  • 2. Align all other sequences against this reference.
  • 3. For all pairs inspect the overlapping regions.
  • 4. Estimate evolutionary distance from substitution rate.

Reference Q1 Q1 Q2 Q2

3

slide-6
SLIDE 6

mpg.png

Phylonium

  • 1. Use one sequence as the common coordinate system.
  • 2. Align all other sequences against this reference.
  • 3. For all pairs inspect the overlapping regions.
  • 4. Estimate evolutionary distance from substitution rate.

Reference Q1 Q1 Q2 Q2

3

slide-7
SLIDE 7

mpg.png

Phylonium

  • 1. Use one sequence as the common coordinate system.
  • 2. Align all other sequences against this reference.
  • 3. For all pairs inspect the overlapping regions.
  • 4. Estimate evolutionary distance from substitution rate.

Reference Q1 Q1 Q2 Q2

3

slide-8
SLIDE 8

mpg.png

Phylonium

  • 1. Use one sequence as the common coordinate system.
  • 2. Align all other sequences against this reference.
  • 3. For all pairs inspect the overlapping regions.
  • 4. Estimate evolutionary distance from substitution rate.

Reference Q1 Q1 Q2 Q2

3

slide-9
SLIDE 9

mpg.png

Quality

10−3 10−2 10−1 10−4 10−3 10−2 10−1 100 Simulated Distance Estimated Distance Accuracy Phylonium

4

slide-10
SLIDE 10

mpg.png

Quality

10−3 10−2 10−1 10−4 10−3 10−2 10−1 100 Simulated Distance Estimated Distance Accuracy Phylonium Mash

4

slide-11
SLIDE 11

mpg.png

Phylogenetic Quality — Robinson-Foulds Distance

A B C D A B C D Robinson-Foulds Distance The RF distance measures the number of partitions in the fjrst tree, but not in the other. Thus, it only considers the topology. For above trees the RF distance is 2.

5

slide-12
SLIDE 12

mpg.png

Phylogenetic Quality — Relative Matrix Dissimilarity

A =    0.1 0.2 0.1 0.3 0.2 0.3    B =    0.11 0.22 0.11 0.33 0.22 0.33    Matrix Dissimilarity Compute the average relative dissimilarity of the entries. d(A, B) = 4 n(n − 1)

  • i
  • j<i

|aij − bij| aij + bij For above examples, d(A, B) = 0.095 approximately 10 %.

6

slide-13
SLIDE 13

mpg.png

109 E. coli Genomes

Mugsy: 2 days (alignment-based) Phylonium: 23 s RF distance: 130 relative dissimilarity: 20 % Mash: 20 s RF distance: 161 relative dissimilarity: 84

7

slide-14
SLIDE 14

mpg.png

109 E. coli Genomes

Mugsy: 2 days (alignment-based) Phylonium: 23 s RF distance: 130 relative dissimilarity: 20 % Mash: 20 s RF distance: 161 relative dissimilarity: 84 %

7

slide-15
SLIDE 15

mpg.png

2681 E. coli from Ensembl Genomes

Phylonium: 378 s Mash: 49 s

8

slide-16
SLIDE 16

mpg.png

Summary

  • Goal: Phylogeny reconstruction from whole genomes.
  • Alignment-free distance methods are fast and accurate.
  • Work best on data from pathogen outbreaks.
  • Scale up to massive data sets.
  • Paper on Phylonium in prep.

kloetzl@evolbio.mpg.de

9