Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome - - PowerPoint PPT Presentation

data analysis of 16s rrna gene amplicons
SMART_READER_LITE
LIVE PREVIEW

Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome - - PowerPoint PPT Presentation

Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome workshop University of the Cape Town, Cape Town, South Africa October 2017 This material is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) 1 Why 16S


slide-1
SLIDE 1

Data analysis of 16S rRNA gene amplicons

Gerrit Botha Microbiome workshop University of the Cape Town, Cape Town, South Africa October 2017

This material is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) 1

slide-2
SLIDE 2

Why 16S rRNA analysis?

To describe microbiota profiles, which can then be related to sample-specific characteristics e.g. how do gut microbiota profiles differ between obese and lean individuals.

Darryl Leja, National Human Genome Research Institute (CC BY 2.0) 2

slide-3
SLIDE 3

Introduction

  • The genes encoding the RNA component of the small subunit of ribosomes,

commonly known as the 16S rRNA in bacteria and archaea, are among the most conserved across all kingdoms of life.

  • They contain regions that are less evolutionarily constrained and whose

sequences are indicative of their phylogeny.

  • Amplification of these genomic regions by PCR from an environmental sample

and subsequent sequencing of a sufficiently large number of individual amplicons enables the analysis of the diversity of clades in the sample and a rough estimate

  • f their relative abundance.

3

slide-4
SLIDE 4

16S rRNA diversity vs shotgun analysis

From: Morgan XC, Huttenhower C (2012) Chapter 12: Human Microbiome Analysis. PLoS Comput Biol 8(12): e1002808. doi:10.1371/journal.pcbi.1002808

4

slide-5
SLIDE 5

5

Secondary structure of 16S rRNA gene

From: Pablo et al., 2014 (http://www.nature.com/nrmicro/journal/v12/n9/fig_tab/nrmicro3330_F1.html)

slide-6
SLIDE 6

6

Justin K et al., 2013

16S rRNA gene region

slide-7
SLIDE 7

Sanger sequencing

  • Good quality
  • Long reads, up to 1000 bp
  • By comparison very little data
  • Up to 384 sequences read in parallel
  • Expensive per base pair

From: https://en.wikipedia.org/wiki/Sanger_sequencing

7

slide-8
SLIDE 8

454 and IonTorrent

  • Based on pyrosequencing
  • About 1 million sequences in parallel
  • 450bp, 700bp or longer
  • Expensive chemicals
  • More errors compared with Sanger
  • Homopolymers
  • Ion Torrent is similar
  • Shorter reads (100bp - 400bp)
  • Much less expensive

From: Rothberg & Leamon, Nature Biotechnology 26, 1117 - 1124 (2008)

8

slide-9
SLIDE 9

Illumina

  • Quite short reads: 2x80-150bp (HiSeq),

2x300bp (MiSeq)

  • 3 billion reads per flow cell (8 lanes;

HiSeq), 20 million reads (MiSeq)

  • Much cheaper than 454
  • More error compared both with Sanger and

454 ○ Error distribution different to 454, single base pairs more frequent, homopolymer errors rare From: https://www.illumina.com/

9

slide-10
SLIDE 10

PacBio SMRT

  • Single Molecule Real-Time Sequencing

nanotechnology: ○ Observe a single nucleotide being added to polymer (fluorescence)

  • Long reads: 3 kb average
  • 90 Mb per run
  • Often combined with Illumina

From: http://www.pacificbiosciences.com

10

slide-11
SLIDE 11

Sequence technology comparisons

From: http://www.molecularecologist.com/next-gen-fieldguide-2016

Instrument Amplification Run time Bases / read bp/run cost/Gb Applied Biosystems 3730 (capillary) PCR, cloning 2 hrs. 650 62,400 $2,307,692.31 454 GS Jr. Titanium emPCR 10 hrs. 400 50,000,000 $19,540.00 454 FLX+ emPCR 20 hrs. 650 650,000,000 $9,538.46 Illumina GA IIx - v5 PE bridgePCR 14 days 288 184,320,000,000 $97.54 Illumina MiSeq v3 bridgePCR 55 hrs. 600 13,200,000,000 $109.24 Illumina HiSeq 2500 - high output v3 BridgePCR 11 days 200 300,000,000,000 $45.27 Illumina HiSeq X (2 flow cells) BridgePCR 3 days 300 1,800,000,000,000 $7.08 Ion Torrent – PGM 318 chip emPCR 7.3 hrs. 400 1,900,000,000 $460.00 Ion Torrent - Proton III (forecast) emPCR 6 hrs. 175 87,500,000,000 $11.43 Life Technologies SOLiD – 5500xl emPCR 8 days 110 155,100,000,000 $67.72 Pacific Biosciences RS II None - SMS 2 hrs. 3000 90,000,000 $1,111.11 Oxford Nanopore MinION (forecast) None - SMS ≤6 hrs. 9000 900,000,000 $1,000.00 Oxford Nanopore GridION 8000 (forecast) None - SMS varies 10000 100,000,000,000 $10.00 paired-end sum

11

slide-12
SLIDE 12

Input file formats

  • SFF (standard flowgram format) - 454
  • Fastq - Illumina
  • BAM (binary of sequence alignment map) – Ion Torrent
  • Metadata in tabular format that can be used for downstream analysis

12

slide-13
SLIDE 13
  • FASTA - Contains sequences and their names
  • FASTQ – FASTA with quality scores
  • SAM – Sequence alignment / map format
  • BAM – Binary sequence alignment / map format
  • GFF3 – Gene feature format
  • Bed – Browser extensible data format, basic genome interval
  • BIOM - Biological observation matrix format

File formats

Fastq SAM VCF BED GFF3 BIOM

slide-14
SLIDE 14

Sample pooling

  • Twelve base error-correcting barcodes allow hundreds of samples per run.

From: Micah Hamady, et al., Nature Methods, 2008. Error-correcting barcodes for pyrosequencing hundreds of samples in multiplex

14

slide-15
SLIDE 15

Fastq format

  • Line 1 begins with a '@' character and is followed by a sequence identifier and an
  • ptional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a '+' character and is optionally followed by the same sequence

identifier (and any description) again.

  • Line 4 encodes the quality values for the sequence in Line 2, and must contain

the same number of symbols as letters in the sequence From: http://en.wikipedia.org/wiki/FASTQ_format

15

slide-16
SLIDE 16

Phred score

  • Used to describe the quality of the output and present this by an integer value.

The larger the value the more confidence there can be in the output.

  • The probability a base is called incorrectly =
  • Q = 10, The probability that the call is wrong = 0.1
  • Q = 20, The probability that the call is wrong = 0.01
  • Q = 30, The probability that the call is wrong = 0.001
  • For Illumina a > 30Q base call quality value is good

16

slide-17
SLIDE 17

Phred score

From: http://en.wikipedia.org/wiki/FASTQ_format

17

slide-18
SLIDE 18

16S rRNA diversity analysis workflow

From: http://h3abionet.org/tools-and-resources/sops/16s-rrna-diversity-analysis

18

slide-19
SLIDE 19

Preprocessing of input reads

  • Demutliplexing
  • Check for short read lengths
  • Remove low quality bases
  • Remove adapters and primers
  • Remove chimeric sequences
  • Merging of reads

19

slide-20
SLIDE 20

OTU picking

  • An operational taxonomic unit is an
  • perational definition of a species or group of

species often used when only DNA sequence data is available.

  • Clusters are formed based on sequence

identity.

  • 3 different approaches
  • De novo OTU picking
  • Closed reference OTU picking
  • Open reference picking
  • A representative sequence is selected for

downstream analysis.

20

slide-21
SLIDE 21

Classification

  • A taxonomic identity is assigned to each

representative sequence.

  • Classification are done against three main

reference databases with aligned, validated and annotated 16S rRNA genes: GreenGenes, Ribosomal Database Project (RDP) and Silva.

21

slide-22
SLIDE 22

Alignment

  • Alignment is the first step in generating a

phylogenetic tree to understand the evolutionary relationship between samples.

  • The aligners of choice are those that does

alignments to a template (secondary structure) of the 16S sequence.

22

slide-23
SLIDE 23

Create a phylogenetic tree

  • The phylogenetic tree represents the

relationship between the sequences in terms

  • f the evolutionary distance from a common

ancestor.

  • In downstream analysis this tree is used for

example in calculating the UniFrac distances and some alpha diversity measures.

23

slide-24
SLIDE 24

Determine alpha diversity

  • Alpha diversity is a measure of diversity

within a sample.

  • It gives an indication of richness and/or

evenness of species present in a sample.

  • Rarefaction analysis is required to

understand the actual diversity within a sample and to determine if your sequencing effort is sufficient and if the total diversity within the sample has been captured.

24

slide-25
SLIDE 25

Determine beta diversity

  • Beta diversity is a measure of diversity

between samples.

  • One of the most commonly used metrics is

the UniFrac distance that compares samples using phylogenetic information. An all-by all

  • r pairwise matrix of the beta diversity metrics

between all the samples in the study is generated and can be visualised in different ways such as a tree, graph, network,

  • rdination methods.

25

slide-26
SLIDE 26

Other statistical analysis

  • Additional statistical tests between samples
  • r groups of samples can be done in QIIME,

and R using

  • OTU tables
  • metadata
  • phylogenetic info

26

slide-27
SLIDE 27

Important ecological concepts

  • How is biodiversity defined and measured?
  • Component of biodiversity: richness, relative abundance, evenness
  • Species richness: number of different species in a habitat/sample
  • Species relative abundance: number of each species relative to total number of all

species in a sample (number of reads per OTU in a sample relative to total number of reads in that sample)

  • Species evenness: how close in numbers each species in an environment are

27

slide-28
SLIDE 28

Diversity levels

  • Alpha diversity: diversity within a habitat sample
  • Beta diversity: diversity between samples
  • Gamma diversity: total diversity in a landscape

From: http://www.webpages.uidaho.edu/veg_measure/Modules/Lessons/Module%209%28 Composition&Diversity%29/9_2_Biodiversity.htm Problem: Doesn’t take abundance of each species OR relatedness of each species into account

28

slide-29
SLIDE 29

Abundance and phylogeny adjustment

29

slide-30
SLIDE 30

Alpha metrics: estimate richness

  • Observed species: count of unique OTUs in a sample
  • Chao1: how likely it is there are more undiscovered species

Sobs = number of species in the sample, F1 = number of singletons (number of species appear once in the sample) F2 = is the number of doubletons (number of species appear twice in the sample). Central concept is that if rare species (singletons) are still being discovered when sampling a community then there is probably more rare species yet to be found. If all species have been found at least twice (doubleton) then it is less likely new species still to be discovered.

  • Both measure richness (number of species)
  • Richness does not take the abundances of the types into account, it is not the

same thing as diversity

  • May be useful for judging “completeness” of sampling, i.e. is sample

size/sequencing depth enough to capture all species?

30

slide-31
SLIDE 31

Alpha metrics: estimate diversity

  • Shannon diversity index:
  • Shannon-Weaver, Shannon-Wiener, or Shannon Index
  • Complicated computation: Information Theory (other metric use this: Brillouin Indices)
  • Shannon Diversity index (H) characterizes species diversity and accounts for abundance and evenness of the species.
  • Shannon equitability index (EH) is a measure of evenness. If S is the number of observed species, then EH = H/ln (S)
  • Simpson diversity index:
  • Simpsons Diversity Index = 1-l, Value between 0 and 1. 0 = no diversity, 1 = infinite diversity
  • Simpson Index:

N= total number of individuals of all species, n = total number of individuals for each species

  • Simpsons reciprocal = 1/I
  • Probability of 2 individuals being conspecifics if drawn randomly from an infinitely large community
  • Simple computation: measures species dominance (weighted towards abundance of most common species) (other metrics: McIntosh, and

Berger-Parker)

  • Total species richness is downweighted relative to evenness
  • Both indices estimate diversity (richness, abundance and evenness)
  • Simpson diversity less sensitive to richness and more sensitive to evenness than

Shannon diversity

31

slide-32
SLIDE 32

Rarefaction

  • Collector’s curves
  • NGS: individuals = reads
  • Evaluate sample size: is sequencing

depth enough?

  • Comparing the richness and diversity
  • bserved in different samples

50 individuals 250 individuals 500 individuals 2 species 4 species 8 species

  • A. R. E. Sinclair, Simon A. R. Mduma, Peter Arcese. Proc. R. Soc. Lond. B 2002 269

2401-2405; DOI: 10.1098/rspb.2002.2116. Published 7 December 2002.

32

slide-33
SLIDE 33

Non-phylogenetic beta diversity:

  • Bray–Curtis dissimilarity: based on species abundance or count data
  • 0 < BC > 1
  • 0 = identical, two sites have all the same species
  • 1 = two sites do not share any species
  • NB: Not a distance
  • Jaccard index: dissimilarity measure for presence–absence data (species present
  • r absent)

33

slide-34
SLIDE 34

Phylogenetic beta diversity: UniFrac distance

34

slide-35
SLIDE 35

UniFrac

  • Raw unweighted UniFrac: sum of branch length that is unique to one environment
  • r the other
  • Raw weighted UniFrac: Branch lengths are weighted by the relative abundance of

sequences

li is the branch length between node i and its parent, and Ai and Bi are indicators equal to 0 or 1 as descendants of node i are absent or present in communities A and B respectively

  • Normalised weighted Unifrac: takes abundance and normalises branch length
  • rapidly evolving lineages (with long branch length can skew unifrac)

A = red, B= blue, branches in common are purple, branches unique to A are red and unique to B are blue. Presence/absence metric.

35

slide-36
SLIDE 36

Principal Coordinate Analysis (PCoA)

  • Also sometimes called classical MDS (multidimensional scaling).
  • Usually constructed using non-Euclidean distance measures (in this case the

UniFrac distance matrix).

  • Represent distance between samples graphically in multidimensional space (n-1

dimension, n = number samples).

  • The goal is to decrease the dimensionality in the data set while preserving the

major patterns of variation in the data set

  • Samples now represented on 2D or 3D plot with these new variables as axes and

the relationship between the sample on the plot should reflect their underlying distance.

  • Ordinates data on plot so that axis 1 (PC1) explains the greatest amount of

variance, axis 2 (PC2) explains the next greatest amount of variance, etc.

  • NMDS is different in two ways (1) the ranks of the distances are used instead of

the actual distance values and (2) the solution it outputs is not unique.

36

slide-37
SLIDE 37
  • Databases may differ in:
  • GreenGenes (archael, bacterial), Silva (archael, bacterial, eukaryotic), RDP

(archael, bacterial)

  • Coverage or number of sequences
  • Taxonomic classification of sequences
  • Frequency of updating
  • Compatibility of data with choice of analysis platform
  • Options

1. Choose same database throughout your study that is compatible with the tools you are using and/or same database used in other studies if you want to compare. 2. Compare analysis using different databases. 3. Choose database of well curated/classified sequences specific to your environment.

Which 16S database?

37

slide-38
SLIDE 38

Available pipelines

  • UPARSE: http://www.drive5.com/uparse
  • IM Tornado: https://github.com/pjeraldo/imtornado2
  • QIIME: http://qiime.org
  • Mothur: http://www.mothur.org
  • FROGS: https://github.com/geraldinepascal/FROGS
  • DADA2: https://github.com/benjjneb/dada2
  • VSEARCH: https://github.com/torognes/vsearch

38

slide-39
SLIDE 39

For practical

  • https://github.com/grbot/16SrRNA-hex-tutorial (contains upstream processing and

downstream R analysis tutorials)

39

slide-40
SLIDE 40

Acknowledgements

  • Andrew Lewis and Timothy Carr for managing all the compute resources on Hex
  • H3ABioNet for providing the training datasets.
  • Tracy Meiring for the slides on ecology principles.

40