SLIDE 1 Population Genomics
Rob Edwards San Diego State University
Image: Lisa Brown for National Public Radio
SLIDE 2 GOM 41 samples 13 sites 5 years SAR 1 sample 1 site 1 year BBC 85 samples 38 sites 8 years ARC 56 samples 16 sites 1 year LI 4 sites 1 year
Phages in the Worlds Oceans
SLIDE 3
Most Marine Phage Sequences are Novel
SLIDE 4
- 6% of SAR sequences ssDNA phage
(Chlamydia-like Microviridae)
- 40% viral particles in SAR are ssDNA phage
- Several full-genome sequences were
recovered via de novo assembly of these fragments
- Confjrmed by PCR and sequencing
Marine Single-Stranded DNA Viruses
SLIDE 5 12,297 sequence fragments hit using TBLASTX over a ~4.5 kb genome
Individual sequence reads Chlamydia phi 4 genome Coverage Concatenated hits Chl4 ORF calls
SAR metagenome and Chlamydia φ4
SLIDE 6
The phage proteomic tree
SLIDE 7
– Degenerate primers that amplify T7 DNA
polymerase
– T
ested samples from around world
– T
ested by difgerent investigators in difgerent laboratories
Signature sequences
Mya Breitbart
SLIDE 8 Breitbart et al, FEMS Micro Lett
~1026 copies of each sequence on the planet = 60 metric tons of this DNA sequence
T7 phages are globally distributed
SLIDE 9 Phage Proteomic T ree v. 5 (Edwards, Rohwer)
ssDNA λ-like T7-like T4-like
Some phages are everywhere
SLIDE 10
Phage P4 – 11kb, 10 ORFs
Azul – individual sequence reads in a metagenome Verde – coverage across genome
Compare viruses to all metagenomes
SLIDE 11
P4 phage genome # metagenome hits
Parts of viruses are everywhere
SLIDE 12
Unknown genes Known genes Viral Microbial
Viruses have lots of unknown genes
SLIDE 13
Bas Dutilh
SLIDE 14 cross Assembly
metagenome 1 metagenome 2 metagenome 3 metagenome 4
SLIDE 15 cross Assembly
Assembly metagenome 1 metagenome 2 metagenome 3 metagenome 4
SLIDE 16 cross Assembly
http://edwards.sdsu.edu/crass/
Contigs directly represent the overlap between samples
SLIDE 17 Reyes et al. Nature 2010
HMP viruses
SLIDE 18 Microbes Phages
Phages are more variable than microbes
Reyes et al. Nature 2010
Functions present in samples
SLIDE 19 1 2 3 4 5 6 7 8 9 10 11 12 1 10 100 1000 10000
Number of samples contributing reads to contig Number of contigs 6,988 de novo cross-contigs
Reyes et al. Nature 2010
De novo assembly HMP data
Number of contigs
SLIDE 20 Big data – microbiome style
F1M F1T1 F1T2 F2M F2T1 F2T2 F3M F3T1 F3T2 F4M F4T1 F4T2
Average depth → Samples →
SLIDE 21
Complete crAssphage genome
SLIDE 22
SLIDE 23
Complete crAssphage genome
SLIDE 24 How big is the chimerization problem?
Assembly algorithms include “chimera protection”
- Break contigs at ambiguities
contig1 contig2 contig4 contig5 contig3
Investigate the efgect of chimerization:
- Use difgerent assembly parameters and assess results
- High stringency
few chimeras →
many chimeras →
SLIDE 25 What are chimeras?
Chimerization is more frequent between closely related strains
Aziz et al. NAR 2010
Venus the chimeric cat
https://www.facebook.com/VenusTheAmazingChimeraCat https://twitter.com/Venustwofacecat
What are intra-phyla chimeras???
SLIDE 26 What are chimeras?
Chimerization is more frequent between closely related strains
- Similar sequences
- What are intra-phyla chimeras???
Aziz et al. NAR 2010
Evolutionary conserved entities!
abundant and conserved enough to assemble
SLIDE 27 What is the host?
1) Sequence homology between phage and bacterial genes 2) Similarity in CRISPR spacers 3) Oligonucleotide usage profjle 4) Co-occurrence across metagenomic samples
- Reads mapped from 152 fecal total community
metagenomes
- Reads mapped to phages and bacteria
- Normalize; Spearman rank correlations; cluster
- crAssphage clusters with Bacteroidetes
- Just like two known Bacteroides phages B40-8 and
B124-14 5) Plaques
SLIDE 28
SLIDE 29
- Requires correct host strain
- Requires phage makes visible
plaques
concentrations of Mg++, Ca++, etc
What is the host?
No PCR hits in at least 100 plaques isolated from 10 pooled viral preparations on Bacteroides fragilis and B. thetaiotaomicron lawns.
SLIDE 30 % Genome position: 0 – 97,065 nt
crAssphage found in intestines
Looked at 2,906 metagenomes Only found in 940 metagenomes
SLIDE 31
crAssphage is abundant!
Abundance-ubiquity plot
SLIDE 32
- Present in 32.3% of sequenced environmental samples (940 / 2,906)
– Includes virus metagenomes and total community metagenomes
- >6x more abundant than all (1,192) other known phages combined
– Corrected for genome size
- Present in 73.4% of sequenced human fecal samples (342 / 466)
– 99.9% of all crAssphage reads were found in feces (signifjcant)
- 1.68% of the reads in all human fecal metagenomes
- Estimate: ~6 crAssphage genomes per Bacteroides genome in your gut
- >90% of the reads in some of virus metagenomes from the US twin study
- 24% of the reads in an unrelated virus metagenome from Korea
- 22% of the reads in total community metagenomes from USA (HMP data)
- Found on every continent (where we have data)
crAssphage by the numbers
SLIDE 33 Virome reads mapping to viral database
Viral database vs crAssphage
Reyes et al. Nature 2010 Virome reads mapping to crAssphage
Unknown sequences
SLIDE 34
– Highly abundant in viral metagenomes size- and density-fjltered for VLPs – ORFs show similarity to bacteriophage and bacterial proteins (no conserved bacterial or archaeal metabolic genes) – Phage-like modularity among functions – Coding structure of the ORFs is typical of a phage genome – Putative prokaryotic promoter patterns – Genome detected in many metagenomes around the world
Potential caveats
SLIDE 35 Summary
- crAssphage is everywhere
- everyone has it (rounding up)
- we don't know what it does
SLIDE 36 metagenomics
- metagenomics 1.0: profjling
- metagenomics 2.0: population genomics
SLIDE 37 Tools for population genomics
- AbundanceBin
- CompostBin
- CONCOCT
- crAss
- GroopM
- Metabat
- mmgenome
SLIDE 38 Discussion points
- How many genomes would you expect in a
population?
- More coverage versus more samples?
- Cutofgs for inclusion (e.g. GC, closeness, etc)
SLIDE 39 Discussion points
How do you know the contigs are from the same organism (genotype)
– http://edwards.sdsu.edu/GenomePeek – BLAST hits – GC content or k-mer composition – Single copy genes – abundance profjles across metagenomes – Paired ends/mate pairs – PCR – PFGE and size comparisons – SIP and metabolically active fraction – Single cell genomics – Culturing / genome sequencing