SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - PowerPoint PPT Presentation

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1

Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM

WET LAB “Garbage in garbage out” It takes a good lab practice to produce reliable data for the downstream processing If we mess-up in wetlab it can not be corrected in dry lab 10/23/2017 Microbiome : Analysis of NGS Data 3

1. PCR and Sequencing flow chart Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing 10/23/2017 Microbiome : Analysis of NGS Data 4

Why the V4 -16S rRNA region? Pros Cons • Well established protocals • Hypervariable regions only • Full overlap of forward • Less information and reverse reads • Less error during • Limited resolution in assembling Bacillus * • Highly reduced sequencing noise 10/23/2017 Microbiome : Analysis of NGS Data 5

2. Raw sequence Reads Quality Assessment 10/23/2017 Microbiome : Analysis of NGS Data 6

Raw sequences FASTQ file always has 4 lines per sequence. ✓ The first line shows the sequence ID and an optional description. ✓ The second line contains a sequence of nucleotides. ✓ The third line generally holds only a “+” symbol and occasionally, the same ID and sequence description as the first line. ✓ The fourth line displays the quality score of each nucleotide shown on the second line. “ The probability of a sequencing error at each position of the nucleotide” 10/23/2017 Microbiome : Analysis of NGS Data 7

1 • For example, if the probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30. • These are special ASCII characters that are used to encode quality values with a single symbol, rather than a double or triple digit. 100 10/23/2017 Microbiome : Analysis of NGS Data 8

VISUALIZE FASTQ FILE SEQUENCE QUALITY FastQC Package (Andrew S, 2010) fastqc_base/fastqc --extract $fastq -f fastq -o $out_dir -t $fastqc_threads" fastqc --extract -f fastq -o $fastqc_dir -t 6 $raw_reads_dir/* fastqc_combine_base/fastqc_combine.pl -v --out $out_dir --skip --files \"$out_dir/*_fastqc\"" 10/23/2017 Microbiome : Analysis of NGS Data 9

Raw Sequences: Sample Dog8_R1 10/23/2017 Microbiome : Analysis of NGS Data 10

Raw Sequence: Sample Dog8_R2 10/23/2017 Microbiome : Analysis of NGS Data 11

What about this quality?? https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html 10/23/2017 Microbiome : Analysis of NGS Data 12

3. Processing of 16S rRNA NGS data 10/23/2017 Microbiome : Analysis of NGS Data 13

Some tools available CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data 10/23/2017 Microbiome : Analysis of NGS Data 14

3.1 Merging Paired End reads UPARSE pipeline uses Usearch commands (Edgar, 2010) Usearch9 – fastq_mergepairs; maxdiff=3 R1 ATGGATCCC G GAGG G GCGCGAAAAGAGAGAGATTCTCC .... 300bp 300bp …..ATGGATCCC T GAGG C GCGCGAAAGGAGAGAGATCTCTCC R2 Merged: ATGGATCCC T GAGG G GCGCGAAA G GAGAGAGATCTCTCC If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected 10/23/2017 Microbiome : Analysis of NGS Data 15

Merged summary output • Fwd /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R1.fastq • Rev /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R2.fastq • Totals: • 79342 Pairs (79.3k) • 70287 Merged (70.3k, 88.59%) • 49910 Alignments with zero diffs (62.90%) • 8990 Too many diffs (> 3) (11.33%) • 0 Fwd tails Q <= 2 trimmed (0.00%) • 174 Rev tails Q <= 2 trimmed (0.22%) • 0 Fwd too short (< 64) after tail trimming (0.00%) • 38 Rev too short (< 64) after tail trimming (0.05%) • 27 No alignment found (0.03%) • 0 Alignment too short (< 16) (0.00%) • 79141 Staggered pairs (99.75%) merged & trimmed • 252.65 Mean alignment length • 252.65 Mean merged length • 0.29 Mean fwd expected errors • 2.23 Mean rev expected errors • 0.03 Mean merged expected errors 10/23/2017 Microbiome : Analysis of NGS Data 16

3.2 Filtering Merged Reads Generally, filtering involves three steps ✓ Based on error contribution of each nucleotide base (maxee) ✓ Primer stripping (nowadays stripped by sequencing platform) ✓ Length truncation Filtering based on maxim expected error (maxee = 0.1) uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence ✓ Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] ✓ 250bp will be rejected if total error > 25. What if maxee = 0.5? ✓ For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125 !!!! 10/23/2017 Microbiome : Analysis of NGS Data 17

Was Quality Control Effective? 10/23/2017 Microbiome : Analysis of NGS Data 18

3.3 FastQC of Merged, Trimmed and Filtered Reads 10/23/2017 Microbiome : Analysis of NGS Data 19

4. Uparse_downstream 10/23/2017 Microbiome : Analysis of NGS Data 20

4.1. De-replication Full length de-replication is done to find a set of unique sequences. Sequences are compared letter by letter Sample result >524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c467927860701b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 21

4.2. Sort sequences by size usearch9 -sortbysize command, min_size=2 Sample result >2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 22

4.3. Denovo Otu-picking Usearch9 cluster_otus , otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm. Edgar, R.C. (2013) 10/23/2017 Microbiome : Analysis of NGS Data 23

4.4. Chimera detection and removal 3. usearch9 -uchime2_ref, gold_db ✓ Chimeric sequences detected and removed Sample output: – 59Mb 100.0% Reading /scratch/DB/bio/qiime/uchime/gold.fa – 26Mb 100.0% Converting to upper case – 27Mb 100.0% Word stats – 27Mb 100.0% Alloc rows – 86Mb 100.0% Build index – 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%) 10/23/2017 Microbiome : Analysis of NGS Data 24

4.5. OTUs - table generation De-dereplication and Qiime compatible otu_table usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_19 2961 151 25 569 212 967 64 330 2691 257 567 374 OTU_2 14004 7549 13826 14747 8370 5715 33064 658 11497 21298 44 OTU_1 10077 29178 11913 9804 10362 33473 22356 25381 8320 13869 OTU_12 1276 586 1185 1258 1906 476 3418 128 1247 998 1510 OTU_4 5932 11319 4568 5609 8082 14859 9988 6492 6135 8157 12908 10/23/2017 Microbiome : Analysis of NGS Data 25

5. Decontamination 10/23/2017 Microbiome : Analysis of NGS Data 26

5.1 Overview of the Decontamination ✓ We need to know OTUs that might be contributed by contamination from reagents used for sampling, DNA extraction and purification, and environments and personnel where DNA was extracted ✓ This is very critical, especially in clinical samples. Why? ✓ These OTUs must be subtracted from biological samples to retain a true representation of the OTUs from the sample of interest. ✓ To achieve this, reagents / blanks [controls] are spiked with known bacteria at the same DNA concentrations as those used in sample under study 10/23/2017 Microbiome : Analysis of NGS Data 27

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - PowerPoint PPT Presentation

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and

Microbiome & Health Human microbiome distribution and functions Human microbiome: microbial

From Microbiome to Microbiome: How Environmental Microbes Are Protecting Our Health Human Health

PennCHOP MICROBIOME PROGRAM Physiologic implications of co-metabolism between the gut microbiome

The Human Microbiome Christine Rodriguez, Ph.D. Harvard Outreach 2012 Summer 2012 Workshop in

Statistical Foundations for Analyzing Human Microbiome Data Human Microbiome Data Patricio S. La

Based on the Gut Microbiome A journey to better health with Microbiome Solutions. Conflict of

Objectives The Human Microbiome and Infectious Disease Understand the advances in technology

What Influences The Male Urogenital Tract Microbiome? Kirsty Lee Garson Supervisor: Prof Nicola

Cbio 16S analysis pipeline Katie Lennard Microbiome analysis workflow Data preprocessing (UCT

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi

Bioinformatics Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven

Automated Design and Scoring of Degenerate Primers From Multiple Taxon-Specific Primers Den

Database searching Using pairwise alignments to search databases for similar sequences Query

Rutherford Scattering & Size of Nucleus distance of closest appoach r size of nucleus 1

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

Disclaimer This presentation has been prepared by Nucleus Wealth and is for general information

The Shell Model: An Unified Description of the Structure of the Nucleus (I) ALFREDO POVES

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - PowerPoint PPT Presentation

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and

Microbiome &amp; Health Human microbiome distribution and functions Human microbiome: microbial

From Microbiome to Microbiome: How Environmental Microbes Are Protecting Our Health Human Health

PennCHOP MICROBIOME PROGRAM Physiologic implications of co-metabolism between the gut microbiome

The Human Microbiome Christine Rodriguez, Ph.D. Harvard Outreach 2012 Summer 2012 Workshop in

Statistical Foundations for Analyzing Human Microbiome Data Human Microbiome Data Patricio S. La

Based on the Gut Microbiome A journey to better health with Microbiome Solutions. Conflict of

Objectives The Human Microbiome and Infectious Disease Understand the advances in technology

What Influences The Male Urogenital Tract Microbiome? Kirsty Lee Garson Supervisor: Prof Nicola

Cbio 16S analysis pipeline Katie Lennard Microbiome analysis workflow Data preprocessing (UCT

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi

Bioinformatics Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven

Automated Design and Scoring of Degenerate Primers From Multiple Taxon-Specific Primers Den

Database searching Using pairwise alignments to search databases for similar sequences Query

Rutherford Scattering &amp; Size of Nucleus distance of closest appoach r size of nucleus 1

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

Disclaimer This presentation has been prepared by Nucleus Wealth and is for general information

The Shell Model: An Unified Description of the Structure of the Nucleus (I) ALFREDO POVES

Microbiome & Health Human microbiome distribution and functions Human microbiome: microbial

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Rutherford Scattering & Size of Nucleus distance of closest appoach r size of nucleus 1