Next Generation Sequencing The basics Wilfred van IJcken Erasmus - - PowerPoint PPT Presentation
Next Generation Sequencing The basics Wilfred van IJcken Erasmus - - PowerPoint PPT Presentation
Center for Biomics Next Generation Sequencing The basics Wilfred van IJcken Erasmus MC Center for Biomics Biomedical Research Techniques (XVIth ed.), Nov 6 Learning objectives Next generation sequencing (NGS): The basics Background
Learning objectives
Next generation sequencing (NGS): The basics Background Illumina sequencing technology Terminology Next presentation Research applications Diagnostic applications Future directions
What is next generation sequencing?
- Sequencing technology developed after Sanger
- Millions of reads in parallel (MPS)
- Shorter (<400bp) sequencing reads
- Enables analysis of complex mixtures of DNA or RNA
- Enables genome wide approach
- Different vendors with different approaches
- MPS = massive parallel sequencing
NGS systems on the market High Throughput Special Desktop
Different characteristics
Sequencing technology Readlength Speed Output Applications Run cost
Illumina systems
- 6 Tb per run
MiSeq HiSeq 2500 NextSeq 500 HiSeq X Ten
Data amount Purchase cost
HiSeq 4000 MiniSeq
8 Gb
Run costs
NovaSeq6000
NGS flow
Isolate Library Sequence Intake Report
ID amount sex disease DNA or RNA blood plasma saliva FFPE cells Select region of interest PCR capture chemistry enzymes detection signal yield quality Variation Match phenotype?
DNA library prep
Sequencing by Synthesis cluster generation
lane flowcell
Bridge amplification
Sequencing
incorporated
Sequencing and basecalling Read 1
Base calling 1 2 3 7 8 9 4 5 6 Image acquisition
C A A G T A A C …
A T G C
SingIe-end, paired end, index read
Single Read Paired end read
GATCG
Index read Single read = sequence from one side of the fragment Paired end = sequence from both sides of the fragment
Indexing enables sample multiplexing
Index = different nucleic acid code per sample introduced during sampleprep read during index read Enables multiple samples in one flowcell lane
GATCG
Index
CGTGA ATCGG TCTCT
Patient 1 Patient 2 Patient 3 Patient 4
Sequence Index 1
Sequence Index 2
Sequence Read 2
1 2 3 7 8 9 4 5 6 Image acquisition
C A A G T A A C …
Summary sequencing technology Read 1 Index 1 Read 2 Index 2
Simplified RNA sample preparation
DNA Reverse transcriptase RNA Adaptor 1 Adaptor 2
Output file from basecalling
- Many file types: qseq, fastq, etc…
- Each system own format.
- Large file sizes: >400 million reads per lane
ASCII Character Q-score PF (0,1) Sequence Instrument Run ID Lane Tile X-coord Y-coord Index # Read #
C A A G T A A C …
Data analysis not trivial due to data volumes and complexity
Data Volume Total Final Comment
HiSeq 2000 200G run Image Data 32 TB Intensity Data 2 TB Optionally transferred Base Call / Quality Score Data 0.25 TB 0.25 TB 1 byte/base (raw) assuming qseq generation offline Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files GAIIx 50G run Image Data 6.9 TB Optionally transferred Intensity Data 0.93 TB 0.93 TB Base Call / Quality Score Data 0.17 TB 0.17 TB Alignment Output 1.2 TB 1.2 TB
150 M reads x 8 lanes x 100 bp x 2 (paired end) = 240 Gbp Storage and compute needed Core facilities
Terminology
- Next generation sequencing, AKA:
- - Deep sequencing
- - MPS = massive parallel sequencing
1 2 3 7 8 9 4 5 6
T G C T A C G A T …
Read Cluster # of sequencing cycles = readlength
Alignment, Mapping
AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGAACGCCGCTAGCTAGGCGC AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGATCGCCGCTAGCTAGGCGC TAGCCTTTTTTCGACTGTCGAGTGGATCGCCG AGCCTTTTTTCGACTGTCGAGTGGATCGCCGC GCCTTTGTTCGACTGTCGAGTGGATCGCCGCT CCTTTGTTCGACTGTCGAGTGGATCGCCGCTA
Consensus sequence Reference sequence Heterozygous SNP mismatch
Read depth
- Average read depth can differ a lot from read depth !
AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGATCGCCGCTAGCTAGGCGC TAGCCTTTTTTCGACTGTCGAGTGGATCGCCG AGCCTTTTTTCGACTGTCGAGTGGATCGCCGC GCCTTTGTTCGACTGTCGAGTGGATCGCCGCT CCTTTGTTCGACTGTCGAGTGGATCGCCGCTA GACTGTCGAGTGGATCGCCGCTAGCTAGG CTGTCGAGTGGATCGCCGCTAGCTAGG 5 7 1
Aka depth of coverage
Accuracy, error rate, quality score
- Single base error rate =
- Total number of mismatched bases found in mapped sequence reads
from a sequencing run, divided by the mappable yield.
- Quality scores (Q scores / phred scores)
- - derived from an examination of the intensity peaks around each base
- - range from 0 – 41, higher corresponds to higher quality
- - Q = -10log10 p, p is basecall error probability
Quality score Probability of incorrect base call Base call accuracy 10 (Q10) 1 in 10 90% 20 (Q20) 1 in 100 99% 30 (Q30) 1 in 1000 99.9%
Traditional vs NextGen Sequencing
1 sequence read per basepair Multiple sequence reads per basepair Sanger sequencing: NGS: