Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome - PowerPoint PPT Presentation

Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome workshop University of the Cape Town, Cape Town, South Africa October 2017 This material is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) 1

Why 16S rRNA analysis? To describe microbiota profiles, which can then be related to sample-specific characteristics e.g. how do gut microbiota profiles differ between obese and lean individuals. Darryl Leja, National Human Genome Research Institute (CC BY 2.0) 2

Introduction ● The genes encoding the RNA component of the small subunit of ribosomes, commonly known as the 16S rRNA in bacteria and archaea, are among the most conserved across all kingdoms of life. ● They contain regions that are less evolutionarily constrained and whose sequences are indicative of their phylogeny. ● Amplification of these genomic regions by PCR from an environmental sample and subsequent sequencing of a sufficiently large number of individual amplicons enables the analysis of the diversity of clades in the sample and a rough estimate of their relative abundance. 3

16S rRNA diversity vs shotgun analysis From: Morgan XC, Huttenhower C (2012) Chapter 12: Human Microbiome Analysis. PLoS Comput Biol 8(12): e1002808. doi:10.1371/journal.pcbi.1002808 4

Secondary structure of 16S rRNA gene From: Pablo et al., 2014 (http://www.nature.com/nrmicro/journal/v12/n9/fig_tab/nrmicro3330_F1.html) 5

16S rRNA gene region Justin K et al., 2013 6

Sanger sequencing ● Good quality ● Long reads, up to 1000 bp ● By comparison very little data ● Up to 384 sequences read in parallel ● Expensive per base pair From: https://en.wikipedia.org/wiki/Sanger_sequencing 7

454 and IonTorrent ● Based on pyrosequencing ● About 1 million sequences in parallel ● 450bp, 700bp or longer ● Expensive chemicals ● More errors compared with Sanger ● Homopolymers ● Ion Torrent is similar ● Shorter reads (100bp - 400bp) ● Much less expensive From: Rothberg & Leamon, Nature Biotechnology 26, 1117 - 1124 (2008) 8

Illumina ● Quite short reads: 2x80-150bp (HiSeq), 2x300bp (MiSeq) ● 3 billion reads per flow cell (8 lanes; HiSeq), 20 million reads (MiSeq) ● Much cheaper than 454 ● More error compared both with Sanger and 454 ○ Error distribution different to 454, single base pairs more frequent, homopolymer errors rare From: https://www.illumina.com/ 9

PacBio SMRT ● Single Molecule Real-Time Sequencing nanotechnology: ○ Observe a single nucleotide being added to polymer (fluorescence) ● Long reads: 3 kb average ● 90 Mb per run ● Often combined with Illumina From: http://www.pacificbiosciences.com 10

Sequence technology comparisons Instrument Amplification Run time Bases / read bp/run cost/Gb Applied Biosystems 3730 (capillary) PCR, cloning 2 hrs. 650 62,400 $2,307,692.31 454 GS Jr. Titanium emPCR 10 hrs. 400 50,000,000 $19,540.00 454 FLX+ emPCR 20 hrs. 650 650,000,000 $9,538.46 Illumina GA IIx - v5 PE bridgePCR 14 days 288 184,320,000,000 $97.54 Illumina MiSeq v3 bridgePCR 55 hrs. 600 13,200,000,000 $109.24 Illumina HiSeq 2500 - high output v3 BridgePCR 11 days 200 300,000,000,000 $45.27 Illumina HiSeq X (2 flow cells) BridgePCR 3 days 300 1,800,000,000,000 $7.08 Ion Torrent – PGM 318 chip emPCR 7.3 hrs. 400 1,900,000,000 $460.00 Ion Torrent - Proton III (forecast) emPCR 6 hrs. 175 87,500,000,000 $11.43 Life Technologies SOLiD – 5500xl emPCR 8 days 110 155,100,000,000 $67.72 Pacific Biosciences RS II None - SMS 2 hrs. 3000 90,000,000 $1,111.11 Oxford Nanopore MinION (forecast) None - SMS ≤6 hrs. 9000 900,000,000 $1,000.00 Oxford Nanopore GridION 8000 (forecast) None - SMS varies 10000 100,000,000,000 $10.00 paired-end sum From: http://www.molecularecologist.com/next-gen-fieldguide-2016 11

Input file formats ● SFF (standard flowgram format) - 454 ● Fastq - Illumina ● BAM (binary of sequence alignment map) – Ion Torrent ● Metadata in tabular format that can be used for downstream analysis 12

Fastq SAM File formats VCF ● FASTA - Contains sequences and their names ● FASTQ – FASTA with quality scores ● SAM – Sequence alignment / map format ● BAM – Binary sequence alignment / map format ● GFF3 – Gene feature format BIOM ● Bed – Browser extensible data format, basic genome interval ● BIOM - Biological observation matrix format GFF3 BED

Sample pooling ● Twelve base error-correcting barcodes allow hundreds of samples per run. From: Micah Hamady, et al., Nature Methods, 2008. Error-correcting barcodes for pyrosequencing hundreds of samples in multiplex 14

Fastq format ● Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). ● Line 2 is the raw sequence letters. ● Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. ● Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence From: http://en.wikipedia.org/wiki/FASTQ_format 15

Phred score ● Used to describe the quality of the output and present this by an integer value. The larger the value the more confidence there can be in the output. ● The probability a base is called incorrectly = ● Q = 10, The probability that the call is wrong = 0.1 ● Q = 20, The probability that the call is wrong = 0.01 ● Q = 30, The probability that the call is wrong = 0.001 ● For Illumina a > 30Q base call quality value is good 16

Phred score From: http://en.wikipedia.org/wiki/FASTQ_format 17

16S rRNA diversity analysis workflow 18 From: http://h3abionet.org/tools-and-resources/sops/16s-rrna-diversity-analysis

Preprocessing of input reads ● Demutliplexing ● Check for short read lengths ● Remove low quality bases ● Remove adapters and primers ● Remove chimeric sequences ● Merging of reads 19

OTU picking ● An operational taxonomic unit is an operational definition of a species or group of species often used when only DNA sequence data is available. ● Clusters are formed based on sequence identity. ● 3 different approaches ● De novo OTU picking ● Closed reference OTU picking ● Open reference picking ● A representative sequence is selected for downstream analysis. 20

Classification ● A taxonomic identity is assigned to each representative sequence. ● Classification are done against three main reference databases with aligned, validated and annotated 16S rRNA genes: GreenGenes, Ribosomal Database Project (RDP) and Silva. 21

Alignment ● Alignment is the first step in generating a phylogenetic tree to understand the evolutionary relationship between samples. ● The aligners of choice are those that does alignments to a template (secondary structure) of the 16S sequence. 22

Create a phylogenetic tree ● The phylogenetic tree represents the relationship between the sequences in terms of the evolutionary distance from a common ancestor. ● In downstream analysis this tree is used for example in calculating the UniFrac distances and some alpha diversity measures. 23

Determine alpha diversity ● Alpha diversity is a measure of diversity within a sample. ● It gives an indication of richness and/or evenness of species present in a sample. ● Rarefaction analysis is required to understand the actual diversity within a sample and to determine if your sequencing effort is sufficient and if the total diversity within the sample has been captured. 24

Determine beta diversity ● Beta diversity is a measure of diversity between samples. ● One of the most commonly used metrics is the UniFrac distance that compares samples using phylogenetic information. An all-by all or pairwise matrix of the beta diversity metrics between all the samples in the study is generated and can be visualised in different ways such as a tree, graph, network, ordination methods. 25

Other statistical analysis ● Additional statistical tests between samples or groups of samples can be done in QIIME, and R using ● OTU tables ● metadata ● phylogenetic info 26

Important ecological concepts ● How is biodiversity defined and measured? ● Component of biodiversity: richness, relative abundance, evenness ● Species richness: number of different species in a habitat/sample ● Species relative abundance: number of each species relative to total number of all species in a sample (number of reads per OTU in a sample relative to total number of reads in that sample) ● Species evenness: how close in numbers each species in an environment are 27

Diversity levels ● Alpha diversity: diversity within a habitat sample ● Beta diversity: diversity between samples Problem: Doesn’t take abundance of ● Gamma diversity: total diversity in a landscape each species OR relatedness of each species into account From: http://www.webpages.uidaho.edu/veg_measure/Modules/Lessons/Module%209%28 Composition&Diversity%29/9_2_Biodiversity.htm 28

Abundance and phylogeny adjustment 29

Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome - PowerPoint PPT Presentation

Data analysis of 16S rRNA gene amplicons Gerrit Botha Microbiome workshop University of the Cape Town, Cape Town, South Africa October 2017 This material is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) 1 Why 16S

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Cbio 16S analysis pipeline Katie Lennard Microbiome analysis workflow Data preprocessing (UCT

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Family-based analysis of genome-wide gene gene interactions Marit Ackermann Biotec TU Dresden

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

Examples of online analysis tools for gene expression data Tools integrated in data repositories

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

DNA/RNAtargeted molecular analysis of bacterial communities Lysis, DNA/RNA DNA/RNA

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Gene-gene and gene-environment interactions in genetic case- control association studies Jurg Ott

Gene finding Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, September 2002

Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling

Bonnie DeVarco - Co-Evolution of Geovisualization and Information Visualization Visualization for

Algorithm Summary Method Input Output Neighbor Joining Distance matrix D T, B Distance

The Origin of Higher Taxa David Campbell University of Alabama (pending full employment somewhere)

Shiny-phyloseq: Web Application for Interactive Microbiome Analysis with Provenance Tracking Paul

The Genetic Code tRNAs charged with uncharged tRNAs an amino acid Polypeptide chain

Ontology-Aware Partitioning for Knowledge Graph Identification Jay Pujara, Hui Miao, Lise Getoor,

Descent nt with Modifica catio tion evolutionary change is based on the interactions Chapter

Phylogenetic tree Michael Schroeder Biotechnology Center TU Dresden Phylogenetic trees

Sambuz

Useful Links

Newsletter

Mail Us