Phylogenetic methods for taxonomic profiling Siavash Mirarab - - PowerPoint PPT Presentation

phylogenetic methods for taxonomic profiling
SMART_READER_LITE
LIVE PREVIEW

Phylogenetic methods for taxonomic profiling Siavash Mirarab - - PowerPoint PPT Presentation

Phylogenetic methods for taxonomic profiling Siavash Mirarab University of California at San Diego (UCSD) Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu Phylogeny reconstruction pipeline gene 1 ACTGCACACCG


slide-1
SLIDE 1

Phylogenetic methods for taxonomic profiling

Siavash Mirarab

University of California at San Diego (UCSD)

Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu

slide-2
SLIDE 2

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

Bioinformatic processing

slide-3
SLIDE 3

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Bioinformatic processing

Step 1: Multiple sequence alignment

slide-4
SLIDE 4

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

slide-5
SLIDE 5

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT

slide-6
SLIDE 6

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

slide-7
SLIDE 7

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

— PASTA — UPP

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

slide-8
SLIDE 8

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

— PASTA — UPP ASTRAL Sta$s$cal ¡ binning

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

slide-9
SLIDE 9

Phylogeny reconstruction pipeline

2

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

— PASTA — UPP ASTRAL Sta$s$cal ¡ binning

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

— SEPP — TIPP

slide-10
SLIDE 10

Phylogeny reconstruction pipeline

3

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

— PASTA — UPP

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

— SEPP — TIPP

slide-11
SLIDE 11

Microbiome analyses using evolutionary trees

4 ACCG CGAG CGGT GGCT TAGA GGAG GcTT

ACCT

Fragmentary metagenomic reads A reference dataset of full length sequences with an alignment and a tree Vibrio ¡ cholerae Yersinia ¡ pestis Salmonella ¡ enterica Salmonella ¡ bongori ¡ Escherichia ¡ coli

Place each fragmentary read independently on a reference tree of known sequences

slide-12
SLIDE 12

Phylogenetic placement

  • Input:
  • A backbone multiple sequence alignment for a marker gene,

including sequences from known species

  • A backbone ML phylogenetic tree, corresponding to the

backbone alignment

  • A collection of (fragmentary, error-prone) query sequences
slide-13
SLIDE 13

Phylogenetic placement

  • Input:
  • A backbone multiple sequence alignment for a marker gene,

including sequences from known species

  • A backbone ML phylogenetic tree, corresponding to the

backbone alignment

  • A collection of (fragmentary, error-prone) query sequences
  • Output: Probabilistic placements of each query sequence on the

phylogenetic tree after (locally) aligning the query to the reference

slide-14
SLIDE 14

Phylogenetic placement

  • Input:
  • A backbone multiple sequence alignment for a marker gene,

including sequences from known species

  • A backbone ML phylogenetic tree, corresponding to the

backbone alignment

  • A collection of (fragmentary, error-prone) query sequences
  • Output: Probabilistic placements of each query sequence on the

phylogenetic tree after (locally) aligning the query to the reference

  • Tools: — Alignment: HMMER


— Placement: pplacer (Matsen) and EPA (RAxML)

slide-15
SLIDE 15

Phylogenetic placement

  • Input:
  • A backbone multiple sequence alignment for a marker gene,

including sequences from known species

  • A backbone ML phylogenetic tree, corresponding to the

backbone alignment

  • A collection of (fragmentary, error-prone) query sequences
  • Output: Probabilistic placements of each query sequence on the

phylogenetic tree after (locally) aligning the query to the reference

  • Tools: — Alignment: HMMER


— Placement: pplacer (Matsen) and EPA (RAxML)

SATe-Enabled Phylogenetic Placement (SEPP)

slide-16
SLIDE 16

Phylogenetic placement simulations

Increasing rate of evolution 0.0

  • S. Mirarab et al., PSB. (2012).
slide-17
SLIDE 17

Reference tree

slide-18
SLIDE 18

HMM for the alignment step

slide-19
SLIDE 19

Ensemble of HMMs

slide-20
SLIDE 20

Ensemble of HMMs

slide-21
SLIDE 21

Ensemble of HMMs

slide-22
SLIDE 22

SATe-Enabled Phylogenetic Placement (SEPP)

Step 1: Align each query sequence to the backbone alignment

  • Use an ensemble of disjoint HMMs instead of using a single

HMM to improve accuracy.

  • The ensemble is created based on the reference tree such

that each model better captures details of a part of a tree

12

  • S. Mirarab et al., PSB. (2012).
slide-23
SLIDE 23

SATe-Enabled Phylogenetic Placement (SEPP)

Step 1: Align each query sequence to the backbone alignment

  • Use an ensemble of disjoint HMMs instead of using a single

HMM to improve accuracy.

  • The ensemble is created based on the reference tree such

that each model better captures details of a part of a tree Step 2: Place each query sequence into the backbone tree, using extended alignment

  • Use divide-and-conquer on the backbone tree to improve

scalability to reference trees with tens of thousands of leaves

12

  • S. Mirarab et al., PSB. (2012).
slide-24
SLIDE 24

SEPP on simulated data

0.0 0.0 Increasing rate of evolution

  • S. Mirarab et al., PSB. (2012).
slide-25
SLIDE 25

SEPP on large 16S references

Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments

Running time Memory

slide-26
SLIDE 26

SEPP on large 16S references

Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments Real data (with Rob Knight’s lab; Daniel McDonald):

  • EMP: placing ~300,000 fragments on the greengenes

reference tree with 203,452 sequences
 8 hours (16 cores) 


  • AG: placing ~40,000 fragments on the greengenes

reference tree with 203,452 sequences 
 10 minutes (16 cores)

Running time Memory

slide-27
SLIDE 27

Taxonomic Profiling

  • Input:
  • Reference multiple sequence alignments for a

collection of marker genes, each including sequenced species

  • Reference trees for marker genes. We force trees to

be compatible with the taxonomy (not necessary).

  • A metagenomic sample: a collection of fragmentary

reads from many species with different abundances

slide-28
SLIDE 28

Taxonomic Profiling

  • Input:
  • Reference multiple sequence alignments for a

collection of marker genes, each including sequenced species

  • Reference trees for marker genes. We force trees to

be compatible with the taxonomy (not necessary).

  • A metagenomic sample: a collection of fragmentary

reads from many species with different abundances

  • Output:
  • The taxonomic profile of the sample

Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 … unclassified 1.2 Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 … unclassified 0.0

slide-29
SLIDE 29

Taxonomic Profiling

  • Input:
  • Reference multiple sequence alignments for a

collection of marker genes, each including sequenced species

  • Reference trees for marker genes. We force trees to

be compatible with the taxonomy (not necessary).

  • A metagenomic sample: a collection of fragmentary

reads from many species with different abundances

  • Output:
  • The taxonomic profile of the sample

Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 … unclassified 1.2 Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 … unclassified 0.0

Taxon Identification and Phylogenetic Profiling (TIPP)

slide-30
SLIDE 30

Algorithmic steps

16

Step 1: map fragments to ~30 “marker” genes using BLAST

Nguyen et al., Bioinformatics (2014)

slide-31
SLIDE 31

Algorithmic steps

16

Step 1: map fragments to ~30 “marker” genes using BLAST Step 2: Use SEPP to place reads on the marker trees

  • Take into account uncertainty: use several alignments and

placements on the tree (to reach a predefined level of statistical support)

Nguyen et al., Bioinformatics (2014)

slide-32
SLIDE 32

Algorithmic steps

16

Step 1: map fragments to ~30 “marker” genes using BLAST Step 2: Use SEPP to place reads on the marker trees

  • Take into account uncertainty: use several alignments and

placements on the tree (to reach a predefined level of statistical support) Step 3: Summarize results across genes to get a taxonomic profile

  • Each read contributes to each branch and all branches above

it proportionally to the probability that it belongs to that branch

  • Results from all genes are simply aggregated as counts

Nguyen et al., Bioinformatics (2014)

slide-33
SLIDE 33

Phylogeny reconstruction pipeline

17

Sequencing samples

gene 2

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000 gene 1 gene 2

ACTGCACACCG
 ACTGCCCCCG
 AATGCCCCCG
 CTGCACACGG CTGAGCATCG
 CTGAGCTCG
 ATGAGCTC
 CTGACACG CAGGCACGCACGAA
 AGCCACGCCATA
 ATGGCACGCCTA
 AGCTACCACGGAT

gene 1000 gene 1

MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

Step 1: Multiple sequence alignment Step 2: Species tree reconstruction

Orangutan Gorilla Chimpanzee Human Phylogeny inference

ACTGCACACCG
 ACTGC-CCCCG
 AATGC-CCCCG


  • CTGCACACGG

CTGAGCATCG
 CTGAGC-TCG
 ATGAGC-TC-
 CTGA-CAC-G CAGAGCACGCACGAA
 AGCA-CACGC-CATA
 ATGAGCACGC-C-TA
 AGC-TAC-CACGGAT

supermatrix

gene 2 gene 1000 gene 1

— PASTA — UPP

Gene tree estimation

Orang. Gorilla Chimp Human

gene 1

Orang. Gorilla Chimp Human

gene 2

Orang. Gorilla Chimp Human

gene 1000

TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 200

  • ---ATGGCGA---

Step 3: Phylogenetic placement

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 30

  • ACATGGCT-----

Orang. Gorilla Chimp Human CAGGCACGCACGAA
 AGC-CACGC-CATA
 ATGGCACGC-C-TA
 AGCTAC-CACGGAT

gene 1000

  • ACATGGCT-----
  • ----CATTGCT--

Orang. Gorilla Chimp Human

— SEPP — TIPP

slide-34
SLIDE 34

Multiple sequence alignment

  • Input: a (potentially ultra-large) set of input sequences

from a single gene

  • sequence may be full or fragmentary
  • Output: a multiple sequence alignment
  • Optional: co-estimate the alignment and tree
  • Relevance: useful to get very large reference alignments

and trees with up to hundreds of thousands of leaves

slide-35
SLIDE 35

Multiple sequence alignment

  • Input: a (potentially ultra-large) set of input sequences

from a single gene

  • sequence may be full or fragmentary
  • Output: a multiple sequence alignment
  • Optional: co-estimate the alignment and tree
  • Relevance: useful to get very large reference alignments

and trees with up to hundreds of thousands of leaves

  • PASTA
  • Great for trees
  • Not good for fragmentary data
slide-36
SLIDE 36

Multiple sequence alignment

  • Input: a (potentially ultra-large) set of input sequences

from a single gene

  • sequence may be full or fragmentary
  • Output: a multiple sequence alignment
  • Optional: co-estimate the alignment and tree
  • Relevance: useful to get very large reference alignments

and trees with up to hundreds of thousands of leaves

  • PASTA
  • Great for trees
  • Not good for fragmentary data
  • UPP
  • Good for fragmentary data
slide-37
SLIDE 37

UPP Steps

  • Step 1: randomly select a “small” subset of full

length sequences (e.g., 1000) as backbone.

  • Step 2: align the backbone using other tools 


(e.g., using PASTA)

  • Step 3: Use a SEPP-like approach to align the

remaining sequences into the reference

  • Note: leaves some “insertion” sites unaligned
slide-38
SLIDE 38

Merge 
 sub-alignments Decompose 
 dataset Align 
 subproblems

ABCDE

PASTA: Iterative divide-and-conquer alignment and tree estimation

A B C D E A B C D E

Estimate tree

A B C D E

20

  • S. Mirarab et al., Res. Comput. Mol. Biol. (2014).
  • S. Mirarab et al., J. Comput. Biol. 22 (2015).
slide-39
SLIDE 39

PASTA on Greengenes

  • Testing the performance of PASTA for building green genes 16S

reference tree

  • Q1: Ability to distinguish samples using unifrac?


|| unweighted || weighted
 || GG | PASTA || GG | PASTA
 88 soils || 0.78 | 0.78 || 0.75 | 0.74
 infant-time-series || 0.55 | 0.55 || 0.37 | 0.42
 moving pictures || 728 | 724 || 2188 | 2439
 global gut || 52.9 | 51.1 || 79 | 72

  • Q2: Speed: 97% tree ( 99,322 leaves): 28 hours 


(16 cores) 99% tree (203,452 leaves): 49 hours

With Daniel McDonald (Knight’s lab) and Uyen Mai

slide-40
SLIDE 40

Software availability

  • PASTA: github.com/smirarab/pasta


(internally uses FastTree, Mafft, HMMER, and OPAL)

  • SEPP: github.com/smirarab/sepp


(internally uses pplacer and HMMER)

  • UPP: https://github.com/smirarab/sepp/blob/master/README.UPP.md

(internally uses HMMER)

  • TIPP: https://github.com/smirarab/sepp/blob/master/README.TIPP.md

(internally uses pplacer and HMMER)

  • Species tree estimation:
  • Statistical binning: https://github.com/smirarab/binning
  • ASTRAL: github.com/smirarab/ASTRAL
slide-41
SLIDE 41

Acknowledgments

  • Nam-Phuong Nguyen
  • Tandy Warnow’s lab:
  • Mike Nute
  • Mihai Pop’s lab:
  • Bo Liu
  • Rob Knight’s lab
  • Daniel McDonlad
  • Mirarab lab
  • Uyen Mai