Next Generation Sequencing in Molecular Diagnostics Wilfred van - - PowerPoint PPT Presentation
Next Generation Sequencing in Molecular Diagnostics Wilfred van - - PowerPoint PPT Presentation
Center for Biomics Next Generation Sequencing in Molecular Diagnostics Wilfred van IJcken, PhD Erasmus MC Center for Biomics Nov 2 2017 Molecular Diagnostics Course XI Learning objectives Next generation sequencing (NGS): The basics
Learning objectives
Next generation sequencing (NGS): The basics Illumina sequencing technology Terminology Enrichment technology Clinical applications Targeted gene panels vs exome vs whole genome NIPT Future directions
Next next next generation sequencing…
- 1st generation sequencing technique: amplified multiple molecule seq
- Sanger sequencing
- 2nd generation sequencing techniques: amplified single molecule seq
- 454 sequencing - Roche
- SBS sequencing - Illumina
- Solid sequencing - Applied biosystems/Life technologies
- Ion Torrent - Life technologies
- 3rd generation sequencing techniques: Single molecule seq
- Helicos tSMS
- PacBio SMRT (real time DNA seq)
- NanoPore Technologies
NGS systems on the market High Throughput Special Desktop
Sequence technology dynamics High Throughput Special Desktop
What is next generation sequencing?
- Sequencing technology developed after Sanger
- Millions of reads in parallel (MPS)
- Shorter (<400bp) sequencing reads
- Enables analysis of complex mixtures of DNA or RNA
- Enables genome wide approach
- Different vendors with different approaches
- MPS = massive parallel sequencing
NGS flow
Isolate Library Sequence Intake Report
ID amount sex disease DNA or RNA blood plasma saliva FFPE cells Select region of interest PCR capture chemistry enzymes detection signal yield quality Variation Match phenotype?
Illumina systems
- 6 Tb per run
MiSeq HiSeq 2500 NextSeq 500 HiSeq X Ten
Data amount Purchase cost
HiSeq 4000 MiniSeq
8 Gb
Run costs
NovaSeq6000
Simplified sample preparation
DNA Reverse transcriptase RNA Adaptor 1 Adaptor 2
Bridge amplification
lane each DNA molecule hybridizes at different location in flowcell lane
Clustering and Sequencing
Cluster growth
5’ 5’ 3’
G T C A G T C A G T C A C A G T C A T C A C C T A G C G T A G T
1 2 3 7 8 9 4 5 6 Image acquisition Base calling
T G C T A C G A T …
Sequencing each base has a different fluorescent dye coupled
Output file from basecalling
- Many file types: qseq, fastq, etc…
- Each system own format.
- Large file sizes: ~150 million reads per lane
ASCII Character Q-score PF (0,1) Sequence Instrument Run ID Lane Tile X-coord Y-coord Index # Read #
Data analysis not trivial due to data volumes and complexity
Data Volume Total Final Comment
HiSeq 2000 200G run Image Data 32 TB Intensity Data 2 TB Optionally transferred Base Call / Quality Score Data 0.25 TB 0.25 TB 1 byte/base (raw) assuming qseq generation offline Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files GAIIx 50G run Image Data 6.9 TB Optionally transferred Intensity Data 0.93 TB 0.93 TB Base Call / Quality Score Data 0.17 TB 0.17 TB Alignment Output 1.2 TB 1.2 TB
Need data storage and compute to handle up to penta bytes of data Core facilities needed
Terminology
- Next generation sequencing, AKA:
- - Deep sequencing
- - MPS = massive parallel sequencing
1 2 3 7 8 9 4 5 6
T G C T A C G A T …
Read Cluster # of sequencing cycles = readlength
SingIe-end, paired end, index read
Single Read Paired end read
GATCG
Index read Single read = sequence from one side of the fragment Paired end = sequence from both sides of the fragment
Indexing enables sample multiplexing
Index = different nucleic acid code per sample introduced during sampleprep read during index read Enables multiple samples in one flowcell lane
GATCG
Index
CGTGA ATCGG TCTCT
Patient 1 Patient 2 Patient 3 Patient 4
Alignment, Mapping
AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGAACGCCGCTAGCTAGGCGC AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGATCGCCGCTAGCTAGGCGC TAGCCTTTTTTCGACTGTCGAGTGGATCGCCG AGCCTTTTTTCGACTGTCGAGTGGATCGCCGC GCCTTTGTTCGACTGTCGAGTGGATCGCCGCT CCTTTGTTCGACTGTCGAGTGGATCGCCGCTA
Consensus sequence Reference sequence Heterozygous SNP mismatch
Read depth
- Average read depth can differ a lot from read depth !
AAAACGCGCTTAGCCTTTTTTCGACTGTCGAGTGGATCGCCGCTAGCTAGGCGC TAGCCTTTTTTCGACTGTCGAGTGGATCGCCG AGCCTTTTTTCGACTGTCGAGTGGATCGCCGC GCCTTTGTTCGACTGTCGAGTGGATCGCCGCT CCTTTGTTCGACTGTCGAGTGGATCGCCGCTA GACTGTCGAGTGGATCGCCGCTAGCTAGG CTGTCGAGTGGATCGCCGCTAGCTAGG 5 7 1
Aka depth of coverage
Accuracy, error rate, quality score
- Single base error rate =
- Total number of mismatched bases found in mapped sequence reads
from a sequencing run, divided by the mappable yield
- Quality scores (Q scores / phred scores)
- - derived from an examination of the intensity peaks around each base
- - range from 0 – 41, higher corresponds to higher quality
- - Q = -10log10 p, p is basecall error probability
Quality score Probability of incorrect base call Base call accuracy 10 (Q10) 1 in 10 90% 20 (Q20) 1 in 100 99% 30 (Q30) 1 in 1000 99.9%
NGS systems on the market High Throughput Special Desktop
Different characteristics
Sequencing technology Readlength Speed Output Applications Run cost
NGS Applications
whole genome De novo sequencing Epigenetic profiling (DNA methylation) Gene expression analysis Discovery of novel transcripts, splice variants, miRNAs Protein-DNA/RNA interactions (ChIPSeq) genomic DNA interactions (3C, 4C, 5C Seq) Targeted DNA sequencing Exome Sequencing Whole genome re-sequencing
Clinical use
Diagnostic applications
- Targetted sequencing
Cardio Myopathies, Ciliopathies, Cancer hotspot panel, Noonan, Neurodegenerative diseases, …
- Exome sequencing
Unknown disease, de novo
- Whole genome sequencing
Unknown disease, non-exonic
- Non invasive diagnostics
prenatal plasma, T21 testing (NIPT)
- Cancer sequencing
germline mutations, therapy
- HLA typing
transplantation
Enrichment technology
Exome = all coding regions (~ exons) of genome
Choose your baits
- Agilent, Nimblegen (Roche), Illumina, IDT, …
exome, panel or other targets
CRE: boosted coverage for ~5000 clinically relevant genes
- Exome performance
- Target coverage
>20X coverage for 95% of genes
- Even coverage
read depth distribution
- Specificity of capture
gene False pos / neg variants High homology genes
V4 CRE halo
Mapping Coverage Sanger + Copy number Variants + Filtering Exome data analysis overview
- Exome depth
- Mapping %, on/off target
- % >20x, min, max, bases not sequenced
- bases <20x add Sanger amplicons
- low frequency variants + indel
GATK: SNP + indel Annotation >100 databases, function
Inheritance
- Dominant, recessive, etc
Quality
- High throughput
- ISO 15189/17025 accredition needed for clinical use in NL
- Sample swap is a real possibility
- Spike-in to uniquely identify each sample after sequencing
A1 B1 C1
Shear Capture Sequencing QC QC Spike-in
How does targetted sequencing result look?
Zoom in sequence result
Variation is not only SNP
GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG ~0.1% of the genomes of any two individuals differ due to SNPs
Structural variants (SVs),
[e.g. kb-Mb-sized deletions, insertions, inversions, fusion genes]
presumably >0.1% of the genome GATT------------GAG GATTTAGATCTCGATAGAG
Short InDels
More difficult to detect than SNPs
SNPs
Recent Case report
2005: 5 weeks old girl hospitalized RS virus with artificial respiration 2008: Developmental delay maybe due to braindamage by hypoxia 2011: Re-evaluation clinical geneticist: possibly Sotos syndrome SNParray, Sanger NSD1, PTEN, AOA, fraX, metabolism: Negative 2015: Re-evaluation: speech affected. WES trio filter for ID genes de novo c.1216C>T, p.Gln406* mutation MECP2
- > atypical form of RETT syndrome
2016: RETT specialist: 5 other girls found with atypical RETT syndrome with c-terminal frame shift mutations in MECP2 (unpublished) WES helps to solve previously unsolved cases Evidence increasing to use WES as first tier care
Human and disease, what to sequence?
- Most mendelian diseases are caused by exome mutations
- Exome is only ~1.6 % of human genome (50Mbp)
Panel Exome Whole genome
Genome >0,01% 1,6 % 95 % Sequencing 1/400x 1x 60x Interpretation ++ + + / - Validation ++ + + / - Speed ++ +
- Cost (est.)
€ 500 € 700 € 3000
Whole genome sequencing
X Ten $1000 genome 30x Outsource $1000 genome 40x
?
Comparision of exome and genome sequencing
Non invasive trisomy testing (NIPT)
DNA isolation Prepare NGS Analysis Trisomy Report
10 weeks pregnancy 5% fetal DNA
NIPT: determine fetal chromosomal copy number
Fetal cfDNA Maternal cfDNA
Fetal Trisomy Euploid Pregnancy Chr 21 Chr 21
Future of NGS
MinION
- USB sized sequencer
- One time use
- $ 900 dollar
- 500 nanopores
- > 1 Gbp
- User defined runtime
- Lifetime electrodes is limiting
(days)
No sample prep Measure directly from blood
Longer readlength e.g Sequel
- Enables first pass de novo assembly, phasing, epigenetics
- Single molecule real time (SMRT) technology
- 1M ZMWs, 500 Mb – 1 Gb
- No amplification bias
- Readlength max 60 kbp
- Accuracy high due to multiple sequencing
- Epigenetic characterization