SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - - PowerPoint PPT Presentation

sanbio bioinformatics training course the microbiome
SMART_READER_LITE
LIVE PREVIEW

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - - PowerPoint PPT Presentation

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and


slide-1
SLIDE 1

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM

10/23/2017 Microbiome : Analysis of NGS Data 1

slide-2
SLIDE 2

Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM

Outline

slide-3
SLIDE 3

WET LAB

“Garbage in garbage out” It takes a good lab practice to produce reliable data for the downstream processing If we mess-up in wetlab it can not be corrected in dry lab

10/23/2017 Microbiome : Analysis of NGS Data 3

slide-4
SLIDE 4
  • 1. PCR and Sequencing flow chart

Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing

10/23/2017 Microbiome : Analysis of NGS Data 4

slide-5
SLIDE 5

Why the V4 -16S rRNA region?

Pros

  • Well established protocals
  • Full overlap of forward

and reverse reads

  • Less error during

assembling

  • Highly reduced sequencing

noise Cons

  • Hypervariable regions
  • nly
  • Less information
  • Limited resolution in

Bacillus*

10/23/2017 Microbiome : Analysis of NGS Data 5

slide-6
SLIDE 6
  • 2. Raw sequence Reads Quality Assessment

10/23/2017 Microbiome : Analysis of NGS Data 6

slide-7
SLIDE 7

Raw sequences

FASTQ file always has 4 lines per sequence. ✓ The first line shows the sequence ID and an optional description. ✓ The second line contains a sequence of nucleotides. ✓ The third line generally holds only a “+” symbol and occasionally, the same ID and sequence description as the first line. ✓ The fourth line displays the quality score of each nucleotide shown on the second line. “ The probability of a sequencing error at each position of the nucleotide”

10/23/2017 Microbiome : Analysis of NGS Data 7

slide-8
SLIDE 8
  • For example, if the

probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30.

  • These are special ASCII

characters that are used to encode quality values with a single symbol, rather than a double or triple digit.

100 1

10/23/2017 Microbiome : Analysis of NGS Data 8

slide-9
SLIDE 9

VISUALIZE FASTQ FILE SEQUENCE QUALITY FastQC Package (Andrew S, 2010) fastqc_base/fastqc --extract $fastq -f fastq -o $out_dir -t $fastqc_threads" fastqc --extract -f fastq -o $fastqc_dir -t 6 $raw_reads_dir/* fastqc_combine_base/fastqc_combine.pl -v --out $out_dir --skip --files \"$out_dir/*_fastqc\""

10/23/2017 Microbiome : Analysis of NGS Data 9

slide-10
SLIDE 10

Raw Sequences: Sample Dog8_R1

10/23/2017 Microbiome : Analysis of NGS Data 10

slide-11
SLIDE 11

Raw Sequence: Sample Dog8_R2

10/23/2017 Microbiome : Analysis of NGS Data 11

slide-12
SLIDE 12

What about this quality??

10/23/2017 Microbiome : Analysis of NGS Data 12

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html

slide-13
SLIDE 13
  • 3. Processing of 16S rRNA NGS data

10/23/2017 Microbiome : Analysis of NGS Data 13

slide-14
SLIDE 14

Some tools available

10/23/2017 Microbiome : Analysis of NGS Data 14

CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data

slide-15
SLIDE 15

3.1 Merging Paired End reads

UPARSE pipeline uses Usearch commands (Edgar, 2010) Usearch9 –fastq_mergepairs; maxdiff=3 R1 ATGGATCCCGGAGGGGCGCGAAAAGAGAGAGATTCTCC ....300bp

300bp …..ATGGATCCCTGAGGCGCGCGAAAGGAGAGAGATCTCTCC R2

Merged: ATGGATCCCTGAGGGGCGCGAAAGGAGAGAGATCTCTCC

If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected

10/23/2017 Microbiome : Analysis of NGS Data 15

slide-16
SLIDE 16

Merged summary output

  • Fwd /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R1.fastq
  • Rev /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R2.fastq
  • Totals:
  • 79342 Pairs (79.3k)
  • 70287 Merged (70.3k, 88.59%)
  • 49910 Alignments with zero diffs (62.90%)
  • 8990 Too many diffs (> 3) (11.33%)
  • 0 Fwd tails Q <= 2 trimmed (0.00%)
  • 174 Rev tails Q <= 2 trimmed (0.22%)
  • 0 Fwd too short (< 64) after tail trimming (0.00%)
  • 38 Rev too short (< 64) after tail trimming (0.05%)
  • 27 No alignment found (0.03%)
  • 0 Alignment too short (< 16) (0.00%)
  • 79141 Staggered pairs (99.75%) merged & trimmed
  • 252.65 Mean alignment length
  • 252.65 Mean merged length
  • 0.29 Mean fwd expected errors
  • 2.23 Mean rev expected errors
  • 0.03 Mean merged expected errors

10/23/2017 Microbiome : Analysis of NGS Data 16

slide-17
SLIDE 17

3.2 Filtering Merged Reads

Generally, filtering involves three steps ✓ Based on error contribution of each nucleotide base (maxee) ✓ Primer stripping (nowadays stripped by sequencing platform) ✓ Length truncation

Filtering based on maxim expected error (maxee = 0.1)

uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence ✓ Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] ✓ 250bp will be rejected if total error > 25.

What if maxee = 0.5?

✓ For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125 !!!!

10/23/2017 Microbiome : Analysis of NGS Data 17

slide-18
SLIDE 18

Was Quality Control Effective?

10/23/2017 Microbiome : Analysis of NGS Data 18

slide-19
SLIDE 19

3.3 FastQC of Merged, Trimmed and Filtered Reads

10/23/2017 Microbiome : Analysis of NGS Data 19

slide-20
SLIDE 20
  • 4. Uparse_downstream

10/23/2017 Microbiome : Analysis of NGS Data 20

slide-21
SLIDE 21

4.1. De-replication

Full length de-replication is done to find a set of unique

  • sequences. Sequences are compared letter by letter

Sample result

>524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c467927860701b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

10/23/2017 Microbiome : Analysis of NGS Data 21

slide-22
SLIDE 22

4.2. Sort sequences by size

usearch9 -sortbysize command, min_size=2

Sample result

>2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG

10/23/2017 Microbiome : Analysis of NGS Data 22

slide-23
SLIDE 23

4.3. Denovo Otu-picking

Edgar, R.C. (2013) Usearch9 cluster_otus , otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm.

10/23/2017 Microbiome : Analysis of NGS Data 23

slide-24
SLIDE 24

4.4. Chimera detection and removal

  • 3. usearch9 -uchime2_ref, gold_db

✓Chimeric sequences detected and removed Sample output:

– 59Mb 100.0% Reading /scratch/DB/bio/qiime/uchime/gold.fa – 26Mb 100.0% Converting to upper case – 27Mb 100.0% Word stats – 27Mb 100.0% Alloc rows – 86Mb 100.0% Build index – 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%)

10/23/2017 Microbiome : Analysis of NGS Data 24

slide-25
SLIDE 25

De-dereplication and Qiime compatible otu_table

usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table

OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_19 2961 151 25 569 212 967 64 330 2691 257 567 374 OTU_2 14004 7549 13826 14747 8370 5715 33064 658 11497 21298 44 OTU_1 10077 29178 11913 9804 10362 33473 22356 25381 8320 13869 OTU_12 1276 586 1185 1258 1906 476 3418 128 1247 998 1510 OTU_4 5932 11319 4568 5609 8082 14859 9988 6492 6135 8157 12908

4.5. OTUs - table generation

10/23/2017 Microbiome : Analysis of NGS Data 25

slide-26
SLIDE 26
  • 5. Decontamination

10/23/2017 Microbiome : Analysis of NGS Data 26

slide-27
SLIDE 27

5.1 Overview of the Decontamination

✓ We need to know OTUs that might be contributed by contamination from reagents used for sampling, DNA extraction and purification, and environments and personnel where DNA was extracted ✓ This is very critical, especially in clinical samples. Why? ✓ These OTUs must be subtracted from biological samples to retain a true representation of the OTUs from the sample of interest. ✓ To achieve this, reagents / blanks [controls] are spiked with known bacteria at the same DNA concentrations as those used in sample under study

10/23/2017 Microbiome : Analysis of NGS Data 27

slide-28
SLIDE 28

5.2 Assign Taxonomy to controls

Note: The assumption here is that the QC has been conducted as explained in previous slides. Taxonomy of controls [spiked ]

assign_taxonomy.py -i otus_repsetOUT.fa -o tax -r gg_db/rep_set/97_otus.fasta -t gg_db/taxonomy/97_otu_taxonomy.txt -m uclust

✓ From the output we can tell spiked and contaminants OTUs

  • In most cases, spiked OTUs will be the most abundant

✓ Spiked OTU sequence will be removed from the controls, thus remaining sequences are contaminants ✓ Contaminant OTUs sequence is aligned to Biological sample sequences at 100% and for their entire length

10/23/2017 Microbiome : Analysis of NGS Data 28

slide-29
SLIDE 29

5.3 Example of spiked control OTUs reads

OTUId sequencing_control primestorecyano P1_G06 sequencing_control primestorecyano P2_G06 sequencing_control primestorecyano P3_G06 sequencing_control primestorecyano P4_G06 OTU_1 12848 9358 12349 10627 OTU_10 1 1 2 OTU_11 18 17 23 15 OTU_12 67 66 64 60 OTU_13 11 13 9 8 OTU_14 443 368 426 349 OTU_15 3 5 5 4 OTU_16 4 5 4 4 OTU_17 1 1 2 OTU_18 2 1 OTU_19 10 13 18 6 OTU_2 2 2 2 OTU_20 8 4 9 5 OTU_21 2 OTU_22 918 771 1125 778 OTU_3 176 193 235 147 OTU_4 10 8 16 10 OTU_5 2363 2106 2607 2553 OTU_6 2 1 OTU_7 6 3 4 4 OTU_8 1 1 OTU_9 1 2 1 1 10/23/2017 Microbiome : Analysis of NGS Data 29

slide-30
SLIDE 30

OUT

Tax % ID Av_reads OTU_1 Bacteria;__Cyanobacteria;__Cyanobacteria;__SubsectionIII;__FamilyI;__Arthrospira 100 11296 OTU_10 Bacteria;__Bacteroidetes;__Flavobacteria;__Flavobacteriales;__Cryomorphaceae;__Fluviicola 88 1 OTU_11 Bacteria;__Deinococcus-Thermus;__Deinococci;__Deinococcales;__Trueperaceae;__Truepera 100 18 OTU_12 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Oceanospirillales;__Alcanivoracaceae;__A 100 64 OTU_13 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Order_Incertae_Sedis;__Family_Incertae_S 100 10 OTU_14 Bacteria;__Bacteroidetes;__Cytophagia;__Order_III_Incertae_Sedis;__ML310M-34;__g 69 397 OTU_15 Bacteria;__Verrucomicrobia;__Opitutae;__Puniceicoccales;__Puniceicoccaceae;__g 72 4 OTU_16 Bacteria;__Actinobacteria;__Acidimicrobiia;__Acidimicrobiales;__Acidimicrobiaceae;__g 86 4 OTU_17 Bacteria 99 1 OTU_18 Bacteria;__Bacteroidetes;__Cytophagia;__Order_III_Incertae_Sedis;__ML310M-34;__g 95 1 OTU_19 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Caulobacterales;__Hyphomonadaceae;__Oc 100 12 OTU_2 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Pasteurellales;__Pasteurellaceae;__Haemo 79 2 OTU_20 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Xanthomonadales;__Sinobacteraceae;__g 52 7 OTU_21 Bacteria;__Lentisphaerae;__Lentisphaeria;__SS1-B-03-39;__f;__g 60 1 OTU_22 Bacteria;__Verrucomicrobia;__Opitutae;__Puniceicoccales;__Puniceicoccaceae;__g 100 898 OTU_3 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhodobacterales;__Rhodobacteraceae 100 188 OTU_4 Bacteria;__Bacteroidetes;__Cytophagia;__Cytophagales;__Cyclobacteriaceae;__g 99 11 OTU_5 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Alteromonadales;__Alteromonadaceae;__M100 2407 OTU_6 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhizobiales;__Phyllobacteriaceae;__Pseudam 79 1 OTU_7 Bacteria;__Proteobacteria;__Gammaproteobacteria;__Pseudomonadales;__Moraxellaceae;__Mo 100 4 OTU_8 Bacteria;__Firmicutes;__Bacilli;__Lactobacillales;__Carnobacteriaceae;__Dolosigranulum 100 1 OTU_9 Bacteria;__Proteobacteria;__Alphaproteobacteria;__Rhizobiales;__Bradyrhizobiaceae;__Salinarim 100 1 10/23/2017 Microbiome : Analysis of NGS Data 30

5.4. Example of Spiked control taxonomy table

Note: Before calculating the average reads of the control replicates, establish whether they are comparable by calculating the % of each OUT in each spiked control

slide-31
SLIDE 31

OTU_1 250 OTU_41 100.00 250 OTU_2 250 OTU_22 100.00 250 OTU_3 250 OTU_77 100.00 250 OTU_4 250 OTU_285 100.00 250 OTU_5 250 OTU_75 100.00 250 OTU_6 250 OTU_1 100.00 250 OUT_7 250 OTU_251 100.00 250 OTU_8 250 No search results. OTU_9 250 OTU_195 100.00 250 OTU_10 250 OTU_208 100.00 250 OTU_11 250 No search results. OTU_12 250 OTU_90 100.00 250 candidate sequence ID candidate nucleotide count template ID BLAST percent identity to template candidate nucleotide count post-NAST

10/23/2017 Microbiome : Analysis of NGS Data 31

Contaminants OTUs Biological OTUs align_seqs.py -i $inDir/conta.fa -o $outDir/decon100 -t $inDir/otus_repsetOUT.fa -e 250 -p 100.0 –m pynast

5.6 Search of contaminant OTUs from Biological samples

slide-32
SLIDE 32

5.7. Removing contaminant sequences

If contaminant OTUs matches at 100% to the OTUs in biological sample, it means that a particular contaminant is present in biological sample otherwise the reverse is true.

✓If present: action taken:

  • 1. If # of average reads in contaminant is similar to biological

sample, then, that OTU is completely removed from Biological sample

  • 2. If # of average reads in contaminant is more than in

Biological sample, again the OTU is completely removed from Biological sample

  • 3. If # of average reads in contaminant is less than in

Biological sample, then, equivalent reads is removed from the BS.

10/23/2017 Microbiome : Analysis of NGS Data 32

slide-33
SLIDE 33

What if % alignment is lowered? i.e. 99%

This implies that we give a chance of 1% to the contaminant to match the BS which actually they could not match at 100% The risk here is that we are going to lose # of sequences /OTUS from the BS which we would otherwise keep at 100%

Note: Since DNA competes during PCR amplification, unless there’s a serious contamination, I think it makes sense that very few sequences from contaminants will match to BS at 100%

10/23/2017 Microbiome : Analysis of NGS Data 33

slide-34
SLIDE 34

5.8 Product of decontamination After removing contaminant OTUs / Sequences; ✓otu_table free from contaminant OTUs/ Sequences and a true representation of sample under investigation is generated ✓Taxonomic annotation clean assigned using the same approach as explained in spiked controls ✓Phylogenetic tree is generated after aligning sequences to the reference database (greengeens/ silva/)

10/23/2017 Microbiome : Analysis of NGS Data 34

slide-35
SLIDE 35
  • 6. Taxonomy assignment of BS

✓Biological Sequences are assigned using the same approach as explained in controls ✓Phylogenetic tree is generated after aligning sequences to the reference db

10/23/2017 Microbiome : Analysis of NGS Data 35

slide-36
SLIDE 36

BIOM file

  • tus_repsetOUT.fa

Phylogenetic tree Taxonomy file

  • tu_table

CBIO-PIPELINE . . . . . .

SUMMARY

Raw sequence data STATISTICIAN

10/23/2017 Microbiome : Analysis of NGS Data 36

slide-37
SLIDE 37

Acknowledgement

10/23/2017 Microbiome : Analysis of NGS Data 37

CBIO Team: Prof Nicola Mulder Gerrit Katie for their constructive inputs

slide-38
SLIDE 38

Thank you for your attentive

10/23/2017 Microbiome : Analysis of NGS Data 38