sanbio bioinformatics training course the microbiome
play

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF - PowerPoint PPT Presentation

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and


  1. SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1

  2. Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM

  3. WET LAB “Garbage in garbage out” It takes a good lab practice to produce reliable data for the downstream processing If we mess-up in wetlab it can not be corrected in dry lab 10/23/2017 Microbiome : Analysis of NGS Data 3

  4. 1. PCR and Sequencing flow chart Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing 10/23/2017 Microbiome : Analysis of NGS Data 4

  5. Why the V4 -16S rRNA region? Pros Cons • Well established protocals • Hypervariable regions only • Full overlap of forward • Less information and reverse reads • Less error during • Limited resolution in assembling Bacillus * • Highly reduced sequencing noise 10/23/2017 Microbiome : Analysis of NGS Data 5

  6. 2. Raw sequence Reads Quality Assessment 10/23/2017 Microbiome : Analysis of NGS Data 6

  7. Raw sequences FASTQ file always has 4 lines per sequence. ✓ The first line shows the sequence ID and an optional description. ✓ The second line contains a sequence of nucleotides. ✓ The third line generally holds only a “+” symbol and occasionally, the same ID and sequence description as the first line. ✓ The fourth line displays the quality score of each nucleotide shown on the second line. “ The probability of a sequencing error at each position of the nucleotide” 10/23/2017 Microbiome : Analysis of NGS Data 7

  8. 1 • For example, if the probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30. • These are special ASCII characters that are used to encode quality values with a single symbol, rather than a double or triple digit. 100 10/23/2017 Microbiome : Analysis of NGS Data 8

  9. VISUALIZE FASTQ FILE SEQUENCE QUALITY FastQC Package (Andrew S, 2010) fastqc_base/fastqc --extract $fastq -f fastq -o $out_dir -t $fastqc_threads" fastqc --extract -f fastq -o $fastqc_dir -t 6 $raw_reads_dir/* fastqc_combine_base/fastqc_combine.pl -v --out $out_dir --skip --files \"$out_dir/*_fastqc\"" 10/23/2017 Microbiome : Analysis of NGS Data 9

  10. Raw Sequences: Sample Dog8_R1 10/23/2017 Microbiome : Analysis of NGS Data 10

  11. Raw Sequence: Sample Dog8_R2 10/23/2017 Microbiome : Analysis of NGS Data 11

  12. What about this quality?? https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html 10/23/2017 Microbiome : Analysis of NGS Data 12

  13. 3. Processing of 16S rRNA NGS data 10/23/2017 Microbiome : Analysis of NGS Data 13

  14. Some tools available CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data 10/23/2017 Microbiome : Analysis of NGS Data 14

  15. 3.1 Merging Paired End reads UPARSE pipeline uses Usearch commands (Edgar, 2010) Usearch9 – fastq_mergepairs; maxdiff=3 R1 ATGGATCCC G GAGG G GCGCGAAAAGAGAGAGATTCTCC .... 300bp 300bp …..ATGGATCCC T GAGG C GCGCGAAAGGAGAGAGATCTCTCC R2 Merged: ATGGATCCC T GAGG G GCGCGAAA G GAGAGAGATCTCTCC If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected 10/23/2017 Microbiome : Analysis of NGS Data 15

  16. Merged summary output • Fwd /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R1.fastq • Rev /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R2.fastq • Totals: • 79342 Pairs (79.3k) • 70287 Merged (70.3k, 88.59%) • 49910 Alignments with zero diffs (62.90%) • 8990 Too many diffs (> 3) (11.33%) • 0 Fwd tails Q <= 2 trimmed (0.00%) • 174 Rev tails Q <= 2 trimmed (0.22%) • 0 Fwd too short (< 64) after tail trimming (0.00%) • 38 Rev too short (< 64) after tail trimming (0.05%) • 27 No alignment found (0.03%) • 0 Alignment too short (< 16) (0.00%) • 79141 Staggered pairs (99.75%) merged & trimmed • 252.65 Mean alignment length • 252.65 Mean merged length • 0.29 Mean fwd expected errors • 2.23 Mean rev expected errors • 0.03 Mean merged expected errors 10/23/2017 Microbiome : Analysis of NGS Data 16

  17. 3.2 Filtering Merged Reads Generally, filtering involves three steps ✓ Based on error contribution of each nucleotide base (maxee) ✓ Primer stripping (nowadays stripped by sequencing platform) ✓ Length truncation Filtering based on maxim expected error (maxee = 0.1) uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence ✓ Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] ✓ 250bp will be rejected if total error > 25. What if maxee = 0.5? ✓ For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125 !!!! 10/23/2017 Microbiome : Analysis of NGS Data 17

  18. Was Quality Control Effective? 10/23/2017 Microbiome : Analysis of NGS Data 18

  19. 3.3 FastQC of Merged, Trimmed and Filtered Reads 10/23/2017 Microbiome : Analysis of NGS Data 19

  20. 4. Uparse_downstream 10/23/2017 Microbiome : Analysis of NGS Data 20

  21. 4.1. De-replication Full length de-replication is done to find a set of unique sequences. Sequences are compared letter by letter Sample result >524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c467927860701b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 21

  22. 4.2. Sort sequences by size usearch9 -sortbysize command, min_size=2 Sample result >2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 22

  23. 4.3. Denovo Otu-picking Usearch9 cluster_otus , otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm. Edgar, R.C. (2013) 10/23/2017 Microbiome : Analysis of NGS Data 23

  24. 4.4. Chimera detection and removal 3. usearch9 -uchime2_ref, gold_db ✓ Chimeric sequences detected and removed Sample output: – 59Mb 100.0% Reading /scratch/DB/bio/qiime/uchime/gold.fa – 26Mb 100.0% Converting to upper case – 27Mb 100.0% Word stats – 27Mb 100.0% Alloc rows – 86Mb 100.0% Build index – 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%) 10/23/2017 Microbiome : Analysis of NGS Data 24

  25. 4.5. OTUs - table generation De-dereplication and Qiime compatible otu_table usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_19 2961 151 25 569 212 967 64 330 2691 257 567 374 OTU_2 14004 7549 13826 14747 8370 5715 33064 658 11497 21298 44 OTU_1 10077 29178 11913 9804 10362 33473 22356 25381 8320 13869 OTU_12 1276 586 1185 1258 1906 476 3418 128 1247 998 1510 OTU_4 5932 11319 4568 5609 8082 14859 9988 6492 6135 8157 12908 10/23/2017 Microbiome : Analysis of NGS Data 25

  26. 5. Decontamination 10/23/2017 Microbiome : Analysis of NGS Data 26

  27. 5.1 Overview of the Decontamination ✓ We need to know OTUs that might be contributed by contamination from reagents used for sampling, DNA extraction and purification, and environments and personnel where DNA was extracted ✓ This is very critical, especially in clinical samples. Why? ✓ These OTUs must be subtracted from biological samples to retain a true representation of the OTUs from the sample of interest. ✓ To achieve this, reagents / blanks [controls] are spiked with known bacteria at the same DNA concentrations as those used in sample under study 10/23/2017 Microbiome : Analysis of NGS Data 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend