Sequencing data files and Quality Control Gilgi Friedlander - PowerPoint PPT Presentation

Raw Illumina Next Generation Sequencing data files and Quality Control Gilgi Friedlander Bioinformatics Unit, Biological Services

IlluminaWorkflow Sample preparation (Application-specific) Load on Flow-Cell Generate clusters Sequence base-by-base Run pipeline + QC

Next Generation Sequencing (NGS) experiments ● Plan the experiment! Design How to perform Single end/paired end read length Replicates Data analysis We encourage to have a “kick - off” meeting Lab- High Throughput Sequencing Unit: Dr. Shirley Horn-Saban - Head of Genomic Technologies Dr. Daniela Amann-Zalcenstein Muriel Chemla Bioinformatics (NGS analysis): Dr. Dena Leshkowitz Dr. Ester Feldmesser Dr. Gilgi Friedlander

Today: ● Illumina’s pipeline ● How is the data organized? ● The format of the output files ● Quality control ● Viewing the data in a genomic browser Exercise

Illumina’s pipeline CASAVA (v1.8.2) C onsensus A ssessment of S equence A nd Va riation Irit Orr

How is the data organized? Unaligned htsdata . . Sample_name1 . Run_ID SampleSheet.csv fastq.gz files 23456_SN808_0051_BA0BC9ABXX divided to multiple files (4M reads in each file) Aligned Unaligned Summary_ Stats Aligned Barcode_Lane_Summary.htm Sample_name1 . . . Export

Output files The FastQ format (standard text representation of short reads)  A FASTQ text file normally uses four lines per sequence.  Example @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65  Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).  Line 2 is the raw sequence letters.  Line 3 begins with a '+' character (optionally followed by SEQ_ID).  Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. The letters encode Phred Quality Scores.

Output files fastq files – cont’ Quality scores Each base has a quality score that measures the probability that a base is called incorrectly Illumina's base scoring is similar to Phred scores-a way of expressing estimates of sequencing error probabilities. The quality score is in ASCII format:ASCII character code - 33 Q phred = ASCII code-33 = -10 log10( Pe ) Pe = error probability of a particular base call Q20 = 1 error in 100 bases Q30 = 1 error in 1000 bases

Qphred Char ASCII ASCII-33 P(error) ! 33 0 1 @SEQ_ID 0.7943 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAAT " 34 1 282 + 0.6309 # 35 2 573 !''*((((***+))%%%++)(%%%%).1***-+*'')) 0.5011 $ 36 3 872 0.3981 % 37 4 072 0.3162 & 38 5 278 0.2511 ' 39 6 886 0.1995 ( 40 7 262 0.1584 ) 41 8 893 0.1258 * 42 9 925 + 43 10 Q phred = -10 log10( Pe ) 0.1 . . . 0.0001 H 72 39 259 I 73 40 0.0001

Output files fastq files – cont’ Divided: each contains 4M read Zipped Contains only passed filtered reads Is filtered ypos flowcell ID lane (N- passed filter) read* tile @instrument run number xpos Index (if multuiplex) @EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG CNAGGCTGGAGTGCAATGGCACAATCTTGGCTCNTNNCANCCTTTGGCTC + @#1ADDDDHCF<D9EGBEE>FHAHBCGICHFBE#1##00#008?FHB>D# * Read number: 1 can be single read or read 2 of paired-end

Output files

Quality Control What is the quality of my data?

Quality Control Two levels: 1. The qualities of the bases of the reads 2. If there is an available genome: What is the fraction of the reads that align to the genome? What is the error rate?

Quality Control When you get your data, you get a mail with the location of the files and with a link to some tables and plots, that looks something like: http://dapsas.weizmann.ac.il/ngsreports/110922 _SN808_0058_BB07HNABXX/

% Error rate The PhiX reads are mapped on the PhiX reference genome, the error rate is then Open the link and go to the Summary tab estimated by the number mismatches, over the total number of bases of mapped PhiX reads should be below 1.5 % reads passed filter % aligned to (chastity filter) reference Average of the four using ELAND intensities at the (the read-mapper first cycle supplied by Illumina) %intensity after 20 cycles should be 50% or more

Quality Control Explore the quality of the data by looking at boxplots of various parameters. Min/Max or maximum of 1.5 times the inter- 75th percentile quartile range Median outliers 25th percentile

Quality Control: viewing plots Check one lane at a time Q phred = -10 log10( Pe ) Q34=> p(error) ~ 0.0004

Quality Control: viewing plots Qualities drop gradually (Q30 => P(error)=0.001) For reads with 50 bases => >90% For reads with 100 bases => >75%

Important: These plots are created during the RUN in the HiSeq. During CASAVA - the qualities are being calibrated.  It is a good idea to look at the quality scores also after the calibration (fastqc tool) And decide whether we would like to filter our reads prior to our downstream analyses

In case we have alignment, it is important to check the % of reads that were aligned

Quality Control: viewing plots Fastqc: a great tool for assessing the quality of the data http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Simon Andrews, Cambridge - UK

Quality Control: viewing plots Good dataset Quality Position in read (bp) http:// www.bioinformatics.bbsrc.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html

Quality Control: viewing plots Poor dataset Quality Position in read (bp) http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html

Genome Browser After you make sure your data is of good quality: Analysis step (beyond the scope of this workshop) During the year the bioinformatics unit will give various workshops for specific NGS applications: http://bip.weizmann.ac.il/ws/ During down stream steps of the analysis we can use a genome browser to view the data and assess specific, local quality, depending on the application. Examples: In RNA seq: investigating a newly identified transcripts In genomic DNA seq: investigating a specific called SNP

Genome Browser There are many available genomic browsers. Among them: UCSC browser IGV (Integrative Genomics Viewer) IGV: A desktop application for integrated visualization of multiple data types and annotations in the context of the genome http://www.broadinstitute.org/software/igv Developed by Jim Robinson, Broad Institute

Genomic Browser: IGV IGV provides a set of hosted genomes, but it is also possible to import other genomes

Sequencing data files and Quality Control Gilgi Friedlander - PowerPoint PPT Presentation

Raw Illumina Next Generation Sequencing data files and Quality Control Gilgi Friedlander Bioinformatics Unit, Biological Services IlluminaWorkflow Sample preparation (Application-specific) Load on Flow-Cell Generate clusters Sequence

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Accessing Files in Python Learning Objectives Concepts about files in Python How to open

Using files ITEC 1630 We save data in files on disk or some Week 9: Files & Streams

Interacting with Files Python Files Files Basic container of data in modern computing

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

Manipulating Data Files in Python Learning Objectives Working with CSV files Reading

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Flat Files vs. DB Files So far, our PHP examples have

Why and how to build up a network of excellence on Triticeae genomics in Europe? Nils Stein,

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Development of Genomics Plugins in i2b2 Lori Phillips, MS AUG Meeting June 18, 2013 Big

Introduction to RNA-Seq Introduction To Bioinformatics Using NGS Data Dag Ahrn 22-May-2019

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Problem From September 1, 2009 to November 6, 2010, there w ere 21 cases of hospital acquired

Resistance to Antiretroviral Drugs HIV-2 HIV-2: Background 1986 Restricted to West

Investor Presentation March 2019 (NZX:TRU) INVESTMENT SUMMARY At TruScreen we are building our

Sambuz

Useful Links

Newsletter

Mail Us

Sequencing data files and Quality Control Gilgi Friedlander - PowerPoint PPT Presentation

Raw Illumina Next Generation Sequencing data files and Quality Control Gilgi Friedlander Bioinformatics Unit, Biological Services IlluminaWorkflow Sample preparation (Application-specific) Load on Flow-Cell Generate clusters Sequence

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Accessing Files in Python Learning Objectives Concepts about files in Python How to open

Using files ITEC 1630 We save data in files on disk or some Week 9: Files &amp; Streams

Interacting with Files Python Files Files Basic container of data in modern computing

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

The Massive Parallel Sequencing era: &quot;Global sequencing&quot; Richard Christen CNRS UMR

Manipulating Data Files in Python Learning Objectives Working with CSV files Reading

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Flat Files vs. DB Files So far, our PHP examples have

Why and how to build up a network of excellence on Triticeae genomics in Europe? Nils Stein,

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Development of Genomics Plugins in i2b2 Lori Phillips, MS AUG Meeting June 18, 2013 Big

Introduction to RNA-Seq Introduction To Bioinformatics Using NGS Data Dag Ahrn 22-May-2019

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Problem From September 1, 2009 to November 6, 2010, there w ere 21 cases of hospital acquired

Resistance to Antiretroviral Drugs HIV-2 HIV-2: Background 1986 Restricted to West

Investor Presentation March 2019 (NZX:TRU) INVESTMENT SUMMARY At TruScreen we are building our

Sambuz

Useful Links

Newsletter

Mail Us

Using files ITEC 1630 We save data in files on disk or some Week 9: Files & Streams

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR