SLIDE 1 DepthOfCoverage
Genetics for Dummies 2017 NGS II – Illumina Sequencing
Robert Kraaij Department of Internal Medicine r.kraaij@erasmusmc.nl
SLIDE 2
- Data Analysis
- Applications
- Example: Exome Sequencing
Overview
SLIDE 3
Things to be addressed
NGS: many short reads that might contain errors data analysis will handle these reads and errors
SLIDE 4
- Data Analysis
- Applications
- Example: Exome Sequencing
Overview
SLIDE 5 cBot flowcell bridgePCR HiSeq2000
Illumina Sequencing
SLIDE 6
Per Cycle Imaging
SLIDE 7
G A T C
Per Cycle Imaging
SLIDE 8
G good quality G poor quality
Per Cycle Base Calling
SLIDE 9
Phred Score Incorrect base Accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 % 0 to 93 ASCII 33 to 126 = single character
Quality Scoring
SLIDE 10
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC +SEQ_ID !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>
FASTQ File
SLIDE 11 T A C G G T A C T T G C A T A G A T T A C G G T A C T T G C A T A G C T Alignment or Mapping of Reads R E F E R E N C E G E N O M E (HG19)
chromosome + position + strand
sample.bam
SLIDE 12 Run QC and filtering
sample.bam
SLIDE 13 sample.bam
- both reads
- quality scores
- chromosome
- position
- quality flag
- duplicate flag
- off target flag
sortedBAM file
SLIDE 14
Coverage T A C G G T A C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T A C T T G C A T A G G A T T A C G G T A C T T G C G G T A C T T G C A T A G C T T T A C G G T A C T T G C A T
5x coverage
SLIDE 15
Mean Coverage
bases on target size of target
SLIDE 16
% of Bases Above a Certain Threshold T A C G G T A C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T A C T T G C A T A G G A T T A C G G T A C T T G C G G T A C T T G C A T A G C T T T A C G G T A C T T G C A T
5x 5x 4x 1x
SLIDE 17
Variant Calling T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T G C T T G C A T A G G A T T A C G G T G C T T G C G G T G C T T G C A T A G C T T T A C G G T G C T T G C A T
G = homozygous alternative
G A T T A C G G T G C C G G T G C T T G C A T A G C T G C A T A G C T - A T T A C G G T G C T T G C A
SLIDE 18
Variant Calling T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T G C T T G C A T A G G A T T A C G G T A C T T G C G G T G C T T G C A T A G C T T T A C G G T A C T T G C A T
A/G = heterozygous
G A T T A C G G T A C C G G T G C T T G C A T A G C T G C A T A G C T - A T T A C G G T G C T T G C A
SLIDE 19
Variant Calling T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T G C T T G C A T A G G A T T A C G G T A C T T G C
A/G = heterozygous?
SLIDE 20 Variant Calling T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T G C T T G C A T A G G A T T A C G G T A C T T G C
G
sequencing quality good poor
SLIDE 21 sample.vcf
- chromosome
- position
- quality
- annotations
VCF File
SLIDE 22
Variant Calling T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A C G G T G C T T G C A T A G T A G G A T T A C G G T A C T T G C G G T G C T T G C A T A G C T G A T T A C G G T A C T T G C A T
deletion = heterozygous
G A T T A C G G T A C C G G T G C T T G C A T A G C T G C A T A G C T - G A T T A C G G T G C T T G C A
SLIDE 23
Paired-End Sequencing
2 x 100 bp
SLIDE 24 Variant Calling: Mate Pairs
normal
400 bp
deletion
800 bp
insertion
200 bp
SLIDE 25 Variant Calling: Mate Pairs
normal
400 bp
translocation
SLIDE 26 Variant Calling: Split Reads
genome
800 bp
mRNA (cDNA)
SLIDE 27
- Data Analysis
- Applications
- Example: Exome Sequencing
Overview
SLIDE 28 Applications
- Re-sequencing full genome SNPs and indels
- Re-sequencing mate pairs structural variations
- Re-sequencing regional SNPs and indels
- Sequencing de novo assembly
- RNAseq
- ChIPseq
- …seq
SLIDE 29 www.illumina.com
SLIDE 30
Example: Exome Sequencing
SLIDE 31 funding by NGI-NCHA, NWO, BBMRI n > 3,000 samples of random set from RS-I start May 2011; Nimblegen part of “CHARGE-S” effort: >5,000 exomes across 4 cohorts Framingham, CHS, ARIC, Rotterdam Study Expand with exome variants array?
CHARGE
Exome Sequencing
SLIDE 32
Exome vs Full Genome
exon exon exon genome 3 Gb exome ~30 Mb
SLIDE 33 Exome Sequencing Workflow
DNA isolation Library preparation Exome capture Sequencing Data analysis
SLIDE 34 + +
Exome capture
SLIDE 35 Nimblegen SeqCap EZ v2 Capture
- CCDS (Sept 2009)
- miRBase (v14, Sept 2009)
- RefSeq (Jan 2010)
- 2,100,000 probes
- 30,246 coding genes
- 329,028 exons
- 710 miRNAs
- 36.5 Mb primary target
- 44.1 Mb capture target
SLIDE 36
Illumina TruSeq V3 2x100 PE Sequencing
SLIDE 37 Data analysis: BWA-GATK pipeline
(CASAVA)
Demultiplexing
MarkDuplicates (picard)
Alignment
Recalibration, IndelRealignment (GATK)
Processing
- HaplotypeCaller
- VQSR
- VarEval
Variant-Calling
VCFtools
R
Analysis
SLIDE 38
Sample QC and Variant QC
SLIDE 39 RSX-2 Samples were sequenced to ~54x Mean Coverage
Average Mean Depth of Coverage across the 44Mb SeqCap Exome Percentage of 44Mb covered 10x or better
SLIDE 40 Mean Depth of Coverage by Flowcell
Mean Depth of Coverage Flowcell Number (Roughly Chronological Order)
SLIDE 41 Freemix Values by Flowcell
Estimated Freemix Values Flowcell Number (Roughly Chronological Order)
SLIDE 42 Determing Heterozygous Concordance versus 550k genotyping arrays
Heterozygous Concordance Flowcell Number (Roughly Chronological Order)
SLIDE 43 Comparing Concordance versus Freemix reveals cutoff around 13% correction
Heterozygous Concordance Estimated Freemix Values
SLIDE 44
Sample QC and Variant QC
SLIDE 45 Number of Detected SNPs per Samples by Flowcell
Flowcell Number (Roughly Chronological Order)
SLIDE 46 Heterozygous to Homozygous ratio per Sample by Flowcell
Flowcell Number (Roughly Chronological Order)
SLIDE 47 purines
Transition to Transversion Ratio
pyrimidines
transversion transition
SLIDE 48 Transition to Transversion Ratio per Sample by Flowcell
Flowcell Number (Roughly Chronological Order)
SLIDE 49
QC and filtering results
SLIDE 50
SLIDE 51 Things to Remember
NGS: many short reads that might contain errors coverage indicates the number of independent reads that cover a base needed to analyse a genome FASTQ file sequence + quality scores BAM file aligned reads VCF file called variants + annotation