Phylogenetic methods for taxonomic profiling
Siavash Mirarab
University of California at San Diego (UCSD)
Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu
Phylogenetic methods for taxonomic profiling Siavash Mirarab - - PowerPoint PPT Presentation
Phylogenetic methods for taxonomic profiling Siavash Mirarab University of California at San Diego (UCSD) Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu Phylogeny reconstruction pipeline gene 1 ACTGCACACCG
Siavash Mirarab
University of California at San Diego (UCSD)
Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu
2
Sequencing samples
gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
Bioinformatic processing
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Bioinformatic processing
Step 1: Multiple sequence alignment
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
— PASTA — UPP
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
— PASTA — UPP ASTRAL Sta$s$cal ¡ binning
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
2
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
— PASTA — UPP ASTRAL Sta$s$cal ¡ binning
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
— SEPP — TIPP
3
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
— PASTA — UPP
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
— SEPP — TIPP
4 ACCG CGAG CGGT GGCT TAGA GGAG GcTT
ACCT
Fragmentary metagenomic reads A reference dataset of full length sequences with an alignment and a tree Vibrio ¡ cholerae Yersinia ¡ pestis Salmonella ¡ enterica Salmonella ¡ bongori ¡ Escherichia ¡ coli
Place each fragmentary read independently on a reference tree of known sequences
including sequences from known species
backbone alignment
including sequences from known species
backbone alignment
phylogenetic tree after (locally) aligning the query to the reference
including sequences from known species
backbone alignment
phylogenetic tree after (locally) aligning the query to the reference
— Placement: pplacer (Matsen) and EPA (RAxML)
including sequences from known species
backbone alignment
phylogenetic tree after (locally) aligning the query to the reference
— Placement: pplacer (Matsen) and EPA (RAxML)
Increasing rate of evolution 0.0
Step 1: Align each query sequence to the backbone alignment
HMM to improve accuracy.
that each model better captures details of a part of a tree
12
Step 1: Align each query sequence to the backbone alignment
HMM to improve accuracy.
that each model better captures details of a part of a tree Step 2: Place each query sequence into the backbone tree, using extended alignment
scalability to reference trees with tens of thousands of leaves
12
0.0 0.0 Increasing rate of evolution
Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments
Running time Memory
Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments Real data (with Rob Knight’s lab; Daniel McDonald):
reference tree with 203,452 sequences 8 hours (16 cores)
reference tree with 203,452 sequences 10 minutes (16 cores)
Running time Memory
collection of marker genes, each including sequenced species
be compatible with the taxonomy (not necessary).
reads from many species with different abundances
collection of marker genes, each including sequenced species
be compatible with the taxonomy (not necessary).
reads from many species with different abundances
Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 … unclassified 1.2 Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 … unclassified 0.0
collection of marker genes, each including sequenced species
be compatible with the taxonomy (not necessary).
reads from many species with different abundances
Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 … unclassified 1.2 Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 … unclassified 0.0
16
Step 1: map fragments to ~30 “marker” genes using BLAST
Nguyen et al., Bioinformatics (2014)
16
Step 1: map fragments to ~30 “marker” genes using BLAST Step 2: Use SEPP to place reads on the marker trees
placements on the tree (to reach a predefined level of statistical support)
Nguyen et al., Bioinformatics (2014)
16
Step 1: map fragments to ~30 “marker” genes using BLAST Step 2: Use SEPP to place reads on the marker trees
placements on the tree (to reach a predefined level of statistical support) Step 3: Summarize results across genes to get a taxonomic profile
it proportionally to the probability that it belongs to that branch
Nguyen et al., Bioinformatics (2014)
17
Sequencing samples
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1 gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000 gene 1
MSA MSA MSA Summary method Orangutan Gorilla Chimpanzee Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
Step 1: Multiple sequence alignment Step 2: Species tree reconstruction
Orangutan Gorilla Chimpanzee Human Phylogeny inference
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrix
gene 2 gene 1000 gene 1
— PASTA — UPP
Gene tree estimation
Orang. Gorilla Chimp Human
gene 1
Orang. Gorilla Chimp Human
gene 2
Orang. Gorilla Chimp Human
gene 1000
TGGCACGCAACG ATGGCACGCTA ATGGCACGCA ATGGCACGA AGCTAACACGGAT CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 200
Step 3: Phylogenetic placement
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 30
Orang. Gorilla Chimp Human CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
Orang. Gorilla Chimp Human
— SEPP — TIPP
from a single gene
and trees with up to hundreds of thousands of leaves
from a single gene
and trees with up to hundreds of thousands of leaves
from a single gene
and trees with up to hundreds of thousands of leaves
length sequences (e.g., 1000) as backbone.
(e.g., using PASTA)
remaining sequences into the reference
Merge sub-alignments Decompose dataset Align subproblems
ABCDE
A B C D E A B C D E
Estimate tree
A B C D E
20
reference tree
|| unweighted || weighted || GG | PASTA || GG | PASTA 88 soils || 0.78 | 0.78 || 0.75 | 0.74 infant-time-series || 0.55 | 0.55 || 0.37 | 0.42 moving pictures || 728 | 724 || 2188 | 2439 global gut || 52.9 | 51.1 || 79 | 72
(16 cores) 99% tree (203,452 leaves): 49 hours
With Daniel McDonald (Knight’s lab) and Uyen Mai
(internally uses FastTree, Mafft, HMMER, and OPAL)
(internally uses pplacer and HMMER)
(internally uses HMMER)
(internally uses pplacer and HMMER)