SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM
10/23/2017 Microbiome : Analysis of NGS Data 1
SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - - PowerPoint PPT Presentation
SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and
10/23/2017 Microbiome : Analysis of NGS Data 1
Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM
10/23/2017 Microbiome : Analysis of NGS Data 3
Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing
10/23/2017 Microbiome : Analysis of NGS Data 4
10/23/2017 Microbiome : Analysis of NGS Data 5
10/23/2017 Microbiome : Analysis of NGS Data 6
FASTQ file always has 4 lines per sequence. ✓ The first line shows the sequence ID and an optional description. ✓ The second line contains a sequence of nucleotides. ✓ The third line generally holds only a “+” symbol and occasionally, the same ID and sequence description as the first line. ✓ The fourth line displays the quality score of each nucleotide shown on the second line. “ The probability of a sequencing error at each position of the nucleotide”
10/23/2017 Microbiome : Analysis of NGS Data 7
100 1
10/23/2017 Microbiome : Analysis of NGS Data 8
10/23/2017 Microbiome : Analysis of NGS Data 9
10/23/2017 Microbiome : Analysis of NGS Data 10
10/23/2017 Microbiome : Analysis of NGS Data 11
10/23/2017 Microbiome : Analysis of NGS Data 12
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html
10/23/2017 Microbiome : Analysis of NGS Data 13
10/23/2017 Microbiome : Analysis of NGS Data 14
CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data
300bp …..ATGGATCCCTGAGGCGCGCGAAAGGAGAGAGATCTCTCC R2
Merged: ATGGATCCCTGAGGGGCGCGAAAGGAGAGAGATCTCTCC
If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected
10/23/2017 Microbiome : Analysis of NGS Data 15
10/23/2017 Microbiome : Analysis of NGS Data 16
Generally, filtering involves three steps ✓ Based on error contribution of each nucleotide base (maxee) ✓ Primer stripping (nowadays stripped by sequencing platform) ✓ Length truncation
Filtering based on maxim expected error (maxee = 0.1)
uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence ✓ Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] ✓ 250bp will be rejected if total error > 25.
What if maxee = 0.5?
✓ For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125 !!!!
10/23/2017 Microbiome : Analysis of NGS Data 17
10/23/2017 Microbiome : Analysis of NGS Data 18
10/23/2017 Microbiome : Analysis of NGS Data 19
10/23/2017 Microbiome : Analysis of NGS Data 20
>524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c467927860701b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
10/23/2017 Microbiome : Analysis of NGS Data 21
>2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG
10/23/2017 Microbiome : Analysis of NGS Data 22
Edgar, R.C. (2013) Usearch9 cluster_otus , otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm.
10/23/2017 Microbiome : Analysis of NGS Data 23
– 59Mb 100.0% Reading /scratch/DB/bio/qiime/uchime/gold.fa – 26Mb 100.0% Converting to upper case – 27Mb 100.0% Word stats – 27Mb 100.0% Alloc rows – 86Mb 100.0% Build index – 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%)
10/23/2017 Microbiome : Analysis of NGS Data 24
usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table
OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_19 2961 151 25 569 212 967 64 330 2691 257 567 374 OTU_2 14004 7549 13826 14747 8370 5715 33064 658 11497 21298 44 OTU_1 10077 29178 11913 9804 10362 33473 22356 25381 8320 13869 OTU_12 1276 586 1185 1258 1906 476 3418 128 1247 998 1510 OTU_4 5932 11319 4568 5609 8082 14859 9988 6492 6135 8157 12908
10/23/2017 Microbiome : Analysis of NGS Data 25
10/23/2017 Microbiome : Analysis of NGS Data 26
10/23/2017 Microbiome : Analysis of NGS Data 27
assign_taxonomy.py -i otus_repsetOUT.fa -o tax -r gg_db/rep_set/97_otus.fasta -t gg_db/taxonomy/97_otu_taxonomy.txt -m uclust
10/23/2017 Microbiome : Analysis of NGS Data 28
OTUId sequencing_control primestorecyano P1_G06 sequencing_control primestorecyano P2_G06 sequencing_control primestorecyano P3_G06 sequencing_control primestorecyano P4_G06 OTU_1 12848 9358 12349 10627 OTU_10 1 1 2 OTU_11 18 17 23 15 OTU_12 67 66 64 60 OTU_13 11 13 9 8 OTU_14 443 368 426 349 OTU_15 3 5 5 4 OTU_16 4 5 4 4 OTU_17 1 1 2 OTU_18 2 1 OTU_19 10 13 18 6 OTU_2 2 2 2 OTU_20 8 4 9 5 OTU_21 2 OTU_22 918 771 1125 778 OTU_3 176 193 235 147 OTU_4 10 8 16 10 OTU_5 2363 2106 2607 2553 OTU_6 2 1 OTU_7 6 3 4 4 OTU_8 1 1 OTU_9 1 2 1 1 10/23/2017 Microbiome : Analysis of NGS Data 29
OUT
Tax % ID Av_reads OTU_1 Bacteria;__Cyanobacteria;__Cyanobacteria;__SubsectionIII;__FamilyI;__Arthrospira 100 11296 OTU_10 Bacteria;__Bacteroidetes;__Flavobacteria;__Flavobacteriales;__Cryomorphaceae;__Fluviicola 88 1 OTU_11 Bacteria;__Deinococcus-Thermus;__Deinococci;__Deinococcales;__Trueperaceae;__Truepera 100 18 OTU_12 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Oceanospirillales;__Alcanivoracaceae;__A 100 64 OTU_13 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Order_Incertae_Sedis;__Family_Incertae_S 100 10 OTU_14 Bacteria;__Bacteroidetes;__Cytophagia;__Order_III_Incertae_Sedis;__ML310M-34;__g 69 397 OTU_15 Bacteria;__Verrucomicrobia;__Opitutae;__Puniceicoccales;__Puniceicoccaceae;__g 72 4 OTU_16 Bacteria;__Actinobacteria;__Acidimicrobiia;__Acidimicrobiales;__Acidimicrobiaceae;__g 86 4 OTU_17 Bacteria 99 1 OTU_18 Bacteria;__Bacteroidetes;__Cytophagia;__Order_III_Incertae_Sedis;__ML310M-34;__g 95 1 OTU_19 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Caulobacterales;__Hyphomonadaceae;__Oc 100 12 OTU_2 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Pasteurellales;__Pasteurellaceae;__Haemo 79 2 OTU_20 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Xanthomonadales;__Sinobacteraceae;__g 52 7 OTU_21 Bacteria;__Lentisphaerae;__Lentisphaeria;__SS1-B-03-39;__f;__g 60 1 OTU_22 Bacteria;__Verrucomicrobia;__Opitutae;__Puniceicoccales;__Puniceicoccaceae;__g 100 898 OTU_3 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhodobacterales;__Rhodobacteraceae 100 188 OTU_4 Bacteria;__Bacteroidetes;__Cytophagia;__Cytophagales;__Cyclobacteriaceae;__g 99 11 OTU_5 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Alteromonadales;__Alteromonadaceae;__M100 2407 OTU_6 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhizobiales;__Phyllobacteriaceae;__Pseudam 79 1 OTU_7 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Pseudomonadales;__Moraxellaceae;__Mo 100 4 OTU_8 Bacteria;__Firmicutes;__Bacilli;__Lactobacillales;__Carnobacteriaceae;__Dolosigranulum 100 1 OTU_9 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhizobiales;__Bradyrhizobiaceae;__Salinarim 100 1 10/23/2017 Microbiome : Analysis of NGS Data 30
Note: Before calculating the average reads of the control replicates, establish whether they are comparable by calculating the % of each OUT in each spiked control
OTU_1 250 OTU_41 100.00 250 OTU_2 250 OTU_22 100.00 250 OTU_3 250 OTU_77 100.00 250 OTU_4 250 OTU_285 100.00 250 OTU_5 250 OTU_75 100.00 250 OTU_6 250 OTU_1 100.00 250 OUT_7 250 OTU_251 100.00 250 OTU_8 250 No search results. OTU_9 250 OTU_195 100.00 250 OTU_10 250 OTU_208 100.00 250 OTU_11 250 No search results. OTU_12 250 OTU_90 100.00 250 candidate sequence ID candidate nucleotide count template ID BLAST percent identity to template candidate nucleotide count post-NAST
10/23/2017 Microbiome : Analysis of NGS Data 31
Contaminants OTUs Biological OTUs align_seqs.py -i $inDir/conta.fa -o $outDir/decon100 -t $inDir/otus_repsetOUT.fa -e 250 -p 100.0 –m pynast
10/23/2017 Microbiome : Analysis of NGS Data 32
10/23/2017 Microbiome : Analysis of NGS Data 33
10/23/2017 Microbiome : Analysis of NGS Data 34
10/23/2017 Microbiome : Analysis of NGS Data 35
BIOM file
Phylogenetic tree Taxonomy file
CBIO-PIPELINE . . . . . .
Raw sequence data STATISTICIAN
10/23/2017 Microbiome : Analysis of NGS Data 36
10/23/2017 Microbiome : Analysis of NGS Data 37
10/23/2017 Microbiome : Analysis of NGS Data 38