Sequencing data files and Quality Control Gilgi Friedlander - - PowerPoint PPT Presentation
Sequencing data files and Quality Control Gilgi Friedlander - - PowerPoint PPT Presentation
Raw Illumina Next Generation Sequencing data files and Quality Control Gilgi Friedlander Bioinformatics Unit, Biological Services IlluminaWorkflow Sample preparation (Application-specific) Load on Flow-Cell Generate clusters Sequence
Sample preparation
(Application-specific)
Load on Flow-Cell Generate clusters Sequence base-by-base Run pipeline + QC
IlluminaWorkflow
Next Generation Sequencing (NGS) experiments
- Plan the experiment!
Design How to perform Single end/paired end read length Replicates Data analysis We encourage to have a “kick-off” meeting Lab- High Throughput Sequencing Unit:
- Dr. Shirley Horn-Saban - Head of Genomic Technologies
- Dr. Daniela Amann-Zalcenstein
Muriel Chemla Bioinformatics (NGS analysis):
- Dr. Dena Leshkowitz
- Dr. Ester Feldmesser
- Dr. Gilgi Friedlander
- How is the data organized?
- Viewing the data in a genomic browser
- Quality control
Today:
- The format of the output files
Exercise
- Illumina’s pipeline
Illumina’s pipeline
CASAVA (v1.8.2) Consensus Assessment of Sequence And Variation Irit Orr
- How is the data organized?
- Viewing the data in a genomic browser
- Quality control
Today:
- The format of the output files
Exercise
- Illumina’s pipeline
How is the data organized?
htsdata Unaligned Aligned Unaligned
Sample_name1
SampleSheet.csv fastq.gz files
divided to multiple files (4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID
. . . . . .
Summary_ Stats Barcode_Lane_Summary.htm
How is the data organized?
htsdata Unaligned Aligned Unaligned
Sample_name1
SampleSheet.csv fastq.gz files
divided to multiple files (4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID
. . . . . .
Summary_ Stats Barcode_Lane_Summary.htm
- How is the data organized?
- Viewing the data in a genomic browser
- Quality control
Today:
- The format of the output files
Exercise
- Illumina’s pipeline
The FastQ format (standard text representation of short reads)
A FASTQ text file normally uses four lines per sequence. Example
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Line 1 begins with a '@' character and is followed by a sequence identifier
and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters. Line 3 begins with a '+' character (optionally followed by SEQ_ID). Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence. The letters encode Phred Quality Scores.
Output files
fastq files– cont’
Quality scores Each base has a quality score that measures the probability that a base is called incorrectly Illumina's base scoring is similar to Phred scores-a way of expressing estimates of sequencing error probabilities. The quality score is in ASCII format:ASCII character code - 33 Q phred = ASCII code-33 = -10 log10( Pe ) Pe = error probability of a particular base call Q20 = 1 error in 100 bases Q30 = 1 error in 1000 bases
Output files
Char ASCII
Qphred ASCII-33 P(error)
! 33 1 " 34 1 0.7943 282 # 35 2 0.6309 573 $ 36 3 0.5011 872 % 37 4 0.3981 072 & 38 5 0.3162 278 ' 39 6 0.2511 886 ( 40 7 0.1995 262 ) 41 8 0.1584 893 * 42 9 0.1258 925 + 43 10 0.1
. . .
H 72 39 0.0001 259 I 73 40 0.0001
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAAT + !''*((((***+))%%%++)(%%%%).1***-+*''))
Q phred = -10 log10( Pe )
@EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG CNAGGCTGGAGTGCAATGGCACAATCTTGGCTCNTNNCANCCTTTGGCTC + @#1ADDDDHCF<D9EGBEE>FHAHBCGICHFBE#1##00#008?FHB>D#
Output files fastq files– cont’
@instrument run number flowcell ID lane tile xpos ypos read* * Read number: 1 can be single read or read 2 of paired-end Is filtered (N- passed filter) Index (if multuiplex) Divided: each contains 4M read Zipped Contains only passed filtered reads
How is the data organized?
htsdata Unaligned Aligned Unaligned
Sample_name1
SampleSheet.csv fastq.gz files
divided to multiple files (4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID
. . . . . .
Summary_ Stats Barcode_Lane_Summary.htm
Output files
How is the data organized?
htsdata Unaligned Aligned Unaligned
Sample_name1
SampleSheet.csv fastq.gz files
divided to multiple files (4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID
. . . . . .
Summary_ Stats Barcode_Lane_Summary.htm
- How is the data organized?
- Viewing the data in a genomic browser
- Quality control
Today:
- The format of the output files
Exercise
- Illumina’s pipeline
Quality Control
What is the quality
- f my data?
Quality Control
Two levels:
- 1. The qualities of the bases of the reads
- 2. If there is an available genome:
What is the fraction of the reads that align to the genome? What is the error rate?
Quality Control
When you get your data, you get a mail with the location of the files and with a link to some tables and plots, that looks something like: http://dapsas.weizmann.ac.il/ngsreports/110922 _SN808_0058_BB07HNABXX/
Open the link and go to the Summary tab % reads passed filter (chastity filter) % aligned to reference using ELAND (the read-mapper supplied by Illumina) Average of the four intensities at the first cycle % Error rate
The PhiX reads are mapped on the PhiX reference genome, the error rate is then estimated by the number mismatches, over the total number of bases of mapped PhiX reads
should be below 1.5 %intensity after 20 cycles should be 50% or more
Quality Control
Explore the quality of the data by looking at boxplots of various parameters.
Median 75th percentile 25th percentile Min/Max or maximum of 1.5 times the inter- quartile range
- utliers
Quality Control: viewing plots
Check one lane at a time Q phred = -10 log10( Pe ) Q34=> p(error) ~ 0.0004
Quality Control: viewing plots
Qualities drop gradually (Q30 => P(error)=0.001) For reads with 50 bases => >90% For reads with 100 bases => >75%
Important: These plots are created during the RUN in the HiSeq. During CASAVA - the qualities are being calibrated. It is a good idea to look at the quality scores also after the calibration (fastqc tool) And decide whether we would like to filter our reads prior to our downstream analyses
In case we have alignment, it is important to check the % of reads that were aligned
How is the data organized?
htsdata Unaligned Aligned Unaligned
Sample_name1
SampleSheet.csv fastq.gz files
divided to multiple files (4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID
. . . . . .
Summary_ Stats Barcode_Lane_Summary.htm
Fastqc: a great tool for assessing the quality of the data
Quality Control: viewing plots
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Simon Andrews, Cambridge - UK
Good dataset
Quality Control: viewing plots
Position in read (bp) Quality http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html
Poor dataset
Quality Control: viewing plots
Position in read (bp) Quality http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html
After you make sure your data is of good quality: Analysis step (beyond the scope of this workshop) During the year the bioinformatics unit will give various workshops for specific NGS applications: http://bip.weizmann.ac.il/ws/ During down stream steps of the analysis we can use a genome browser to view the data and assess specific, local quality, depending on the application. Examples: In RNA seq: investigating a newly identified transcripts In genomic DNA seq: investigating a specific called SNP
Genome Browser
- How is the data organized?
- Viewing the data in a genomic browser
- Quality control
Today:
- The format of the output files
Exercise
- Illumina’s pipeline
Genome Browser
There are many available genomic browsers. Among them: UCSC browser IGV (Integrative Genomics Viewer) IGV: A desktop application for integrated visualization of multiple data types and annotations in the context
- f the genome
http://www.broadinstitute.org/software/igv Developed by Jim Robinson, Broad Institute
Genomic Browser: IGV
IGV provides a set of hosted genomes, but it is also possible to import other genomes