Mutation detection in massively parallel sequencing 2012 Winter - - PowerPoint PPT Presentation
Mutation detection in massively parallel sequencing 2012 Winter - - PowerPoint PPT Presentation
Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and Computational Biology Ann-Marie Patch Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research
Sequencing literature
Virology – 11 new complete genomes in this month’s issue of Journal of Virology
From medicine to evolution - large scale sequencing data is impacting all of our research
Sequencing literature
Bacterial genome sequencing BMC Microbiology
From medicine to evolution - large scale sequencing data is impacting all of our research
Sequencing literature
Fungi and Plant genomes sequencing
From medicine to evolution - large scale sequencing data is impacting all of our research
Sequencing literature
Vertebrate Evolution nature
From medicine to evolution - large scale sequencing data is impacting all of our research
Sequencing literature
Human cancer genetics nature
From medicine to evolution - large scale sequencing data is impacting all of our research
Why do we want to sequence?
ICGC aims to obtain a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical importance across the globe http://www.icgc.org/
International Cancer Genome Consortium Nature 2010website
Basic level: Comparing the characteristics of two or more populations of cells and recording the differences
The 1000 Genomes Project is an international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies
The 1000 Human Genomes Project Consortium Nature 2010 http://www.1000genomes.org/
Talk on metagenomics tomorrow morning
How many changes are we going to find ?
Human genome Illustration The genomes of any two people will typically have: 1 single base difference every ~1000bp (0.1%) Add in all other classes of mutations and typically <0.5% of two human genomes are different In Craig Venter’s genome 4.1 million DNA variants were reported (12,500 in the exome ) encompassing in total a huge 12.3 Mb (0.41%)
Levey et al 2007, Ng et al 2008
What sort of differences will we find in DNA sequencing?
Mutation = a change in nucleic acid sequence
- Single Nucleotide variations (SNV)
- Small insertions and deletions (INDELS)
- Large chromosome rearrangements
- Copy number changes
Cloonan 2010
RNA talks later today
Mutation Detection
- Sequencing Basics
- Small mutations, SNVs and indels
- Genomic rearrangements
- Copy number changes
- Verification
Sequencing process recap - library preparation
Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Paired-end sequencing
Sequencing process recap - library preparation
Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Align back to a reference genome
I II I I II
Reference genome
Read pairs are mapped to a reference genome
How many reads stack on top of each other at any one position is called the coverage depth Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutations and structural rearrangements
Paired-end sequences mapped to genome Coverage depth
I II I I II
Reference genome
Read pairs are mapped to a reference genome We convert our data into positional information and counts
“How many bases out of the total are different to the reference at any position”
* * *
Single Nucleotide Variant Copy Number Loss Homozygous Copy Number Loss Heterozygous Large chromosome rearrangement Small insertions and deletions (INDELs) Copy Number Gain
- Sequencing Basics
- Small mutations, SNVs and indels
- Genomic rearrangements
- Copy number changes
- Verification
Mutation Detection
Detecting single nucleotide variants (SNVs)
ACGATATTACACGTACACTCAAGTCGTTCGGAACCT ACGATATTACACGTACATTCAAATCGT ACGTTATTACACGTACATTCAACTCGT ACGATATTACACGCACATTCAAGTCGT CGATCTTACACGTACATTCAAGTCGTT ATATTTCACGTACATTCAAGTCGTTCG ATATTAAA-GTACATTCAAGTCGTTCG ATTACACGTACATTCAAGTCGATCGGA ATTACACGTACATTCACGTCGTTCGGA CACGTACATTCGAGTCGTTCGGAACCT
- ----------------T------------------
SNV call Aligned Reads Reference Mutation = Homozygous 18 C>T 9 bases out of a total of 9 reads covering this position do not match the reference
Coverage
10 20 30
What counts are acceptable?
4 out of 9 Heterozygous C>T 2 out of 9 Heterozygous C>T ? How good is the data?
SNV call SNV call
Controlling the quality of sequencing data
Filtering data can take place at various stages Pre-alignment
- Remove or trim reads where base quality is low
e.g. SolexaQA (Cox et al 2010)
Per base quality at the 3’ end of sequence reads Quality
76 75 74 73 72 71 70 69 68 67 66 65 64
q20
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Talk this afternoon on trimming and errors
Control the quality of input data before calling variants
Alignment thresholds
- Set minimums for mapping quality
Post-alignment
- Marking duplicates e.g. Picard (http://picard.sourceforge.net)
- Set maximum number of mismatches for a read
- Flagging reads that map to more than one location in the genome
3 mismatches PCR duplicate reads
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Software for calling variants
Many software tools available ...more than are listed here
GATK – McKenna et al 2010 Genome Res SAMtools (mpileup and BCFtools) Li 2009 Bioinformatics DiBayes – SOLiD software http://www.lifetechnologies.com InGAP – Qi 2011 Nucle Acids Res MAQGene – Bigelow 2009 Nat. Methods C. elegans only PolyBayesShort - http://bioinformatics.bc.edu/marthlab/PbShort SomaticSniper – Larson 2012 Bioinformatics Sniper – Simola 2011 Genome Biol Strelka – Saunders 2012 Bioinformatics Dindel – Albers 2011 Genome Res SNiPlay – Dereeper 2011 BMC Bioinformatics SRiC – Zang 2011 BMC Genomics qSNP – QCMG manuscript in preparation http://seqanswers.com/wiki/Software/list
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Evaluate the overall calling of variants by the software
SNP concordance with genotyping arrays Germline variants Illumina array (+) Illumina array (-) SOLiD sequencing (+)
339,935 1,453
SOLiD sequencing (-)
5,806 434,554
(+) variant called by technology (-) variant not called
Sensitivity 97% Specificity 99% Effective median coverage 37
TP/(TP+FN) TN/(TN+FP) True Positives
Genotyping array calls Sequencing calls
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants False Positives False Negatives True Negatives
Visualise the variants called by the software
Visualising SNV calls in IGV
(IGV: http://www.broadinstitute.org/software/igv/)
Assess coverage and quality Check for hidden duplicates Examine sequence context
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Visualise the variants called by the software
Visualising small INDELS in IGV
(IGV: http://www.broadinstitute.org/software/igv/)
Assess coverage and quality Check for hidden duplicates Examine sequence context
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Solutions for annotating variants
SeattleSeq - http://snp.gs.washington.edu/SeattleSeqAnnotation131/ MU2A - Garla V et al 2010. Bioinformatics Segtor - Renaud et al 2011 Plos One Galaxy - http://galaxy.psu.edu/ ANNOVAR - http://www.openbioinformatics.org/annovar/ Ensembl Perl API - http://www.ensembl.org
And more ...
- Downstream, Upstream (5kb)
- Intergenic
- Intronic
- Essential Splice site
- 5’UTR, 3’UTR
- Synonymous Coding
- Non-Synonymous Coding
- Stop gained, Stop lost
- Within non-coding gene
- mi RNA
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Measuring how damaging a variant might be
Polyphen Adzhubei 2010, Sift Kumar 2009, MutationTaster Schwarz 2010 Polyphen2 (http://genetics.bwh.harvard.edu/pph2/)
Rank variants on a number of characteristics
- Conservation of amino acid across species e.g. GERP, phastCons, PhyloP
- Assess potential for damage using e.g. Sift, PolyPhen2, MutationTaster
- Manual curation i.e. Variation within a known candidate gene or locus of
interest
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Talk on Thursday about pathway analysis
GERP Davydov 2010, Goode 2010 phastCons and phyloP Pollard 2009, Siepel 2005
Pre-filter:
- remove duplicates
- alignment length >34 or (F5 and in proper pair)
- mapping quality > 14
- less than 3 mismatches to reference
qSNP:
- Pileup of variants in Tumour and Normal bams
- Coverage minimum of 12 reads in the normal
- Calls somatic if not in pileup of matched normal
- Flags if variant has been seen in a the normal of another patient
- Annotation using Ensembl API
Evaluation of variants:
- > 3 novel starts supporting mutation/variant
- Coverage
- IGV review
Summary of workflow for mutation detection QCMG workflow
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
- Sequencing Basics
- Small mutations, SNVs and indels
- Genomic rearrangements
- Copy number changes
- Verification
Mutation Detection
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Genomic rearrangements
Genomic variants arise typically from incorrectly repaired DNA damage
Deletions Insertions Translocations Spectral Karyotype Analysis (SKY) Edwards et al 2010 Journal of Pathology Other detection methods include Microscopy based Metaphase spread karyotype, Florescent in-situ hybridisation (FISH) and Spectral Karyotype analysis Array based Array CGH (comparative genomic hybridisation) High density genotyping arrays
Paired-end sequencing detection of genomic rearrangements
Deletion example Library preparation size selection is key
200-400 bp
Expected Distance Between Reads
Size range Pair of reads map across a deletion present in your sample When they are mapped back to the reference genome that does not have that deletion The distance between the two reads is bigger than the expected size range Size selected sheared genomic DNA
200 400
DNA repaired with some sequence missing
Discordant pair mapping to identify genomic rearrangements
Sample Reference Sample Reference Sample Reference Discordant pair identification tools qSV – in house developed tool BreakDancer – Chen et al 2009 SVDetect – Zeitouni et al 2010 All use the unexpected distance or orientation of read pairs to identify genomic rearrangements Evidence of a rearrangement is inferred from the number of abnormally mapping read pairs clustering at genomic positions
Genomic rearrangements detected by soft clipping clusters
Soft clipping will often happen around breakpoints of genomic rearrangements (if your aligner supports soft clipping) Tools have been developed that search for clusters of reads with soft clips and performs remapping CREST – Wang et al 2011 ClipCrop – Suzuki et al 2011
Suzuki et al 2011
Genomic rearrangement detection by split read clustering
Similarly, if your aligner allows large gaps when mapping reads back to the reference genome split reads will be found at breakpoints of genomic rearrangements Pindel (Ye et al 2009) is a tool developed to search for clusters of split reads
diagram, Ye et al 2009 Alkan 2011 review article
- Sequencing Basics
- Small mutations, SNVs and indels
- Genomic rearrangements
- Copy number changes
- Verification
Mutation Detection
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Copy number mutations
Detection of copy number changes can be achieved by measuring signal intensities on genotyping SNP chips or array CGH e.g. Illumina SNP arrays report Log R ratio and B-allele frequency for each probe We can interpret these signals as changes in copy number They are low resolution
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
CN=2 CN=1 CN=3
Log R Ratio B allele Frequency http://www.illumina.com
I II I I II
Reference genome
Copy number mutations
* * *
Copy Number Loss Homozygous Copy Number Loss Heterozygous Copy Number Gain
Copy number changes in sequencing data
Coverage depth is not even across the genome
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
e.g. Coverage varies across genomes due to ‘mapability’ of repeated sequence regions This is not evidence of copy number changes
McKenna et al 2010 (and see Koehler et al 2010)
CNV often associated with a SV
CNVnator (Abyzov et al 2011) is a tool to identify copy number variants from read depth partitioning and GC content correction Copy number variants are also able to be identified in targeted capture sequencing
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Genomic position
Visualisation of genomic rearrangements and copy number variants
Krzywinski et al 2009
Circos is a visualisation tool for creating attractive plots of genomic data tracks
chromosomes SNP array track that shows copy number gain in red and loss in green SNP array track that shows heterozygous and homozygous regions
Centre sequencing data
Translocations => blue Deletions => green Duplications => red Inversions => orange intrachromosomal Rearrangements => light blue
Combining copy number and genomic rearrangements
Stephens et al 2011 Cell
Copy number changes and genomic variants are often associated Combining these data can be challenging
- Sequencing Basics
- Small mutations, SNVs and indels
- Genomic rearrangements
- Copy number changes
- Verification
Mutation Detection
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Verification of variants is an essential part of the workflow
Capillary sequencing
Verify ideally with a different technology Verification of variants called is a useful internal quality check Feedback loops should inform the whole process It allows further development of the variant calling workflow To balance sensitivity and specificity
Non optical sequencing
Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants
Sanger method sequencing Ion Torrent Life Technologies
Summary
Challenges in mutation detection
- Poisson sampling (low coverage
regions)
- Mismapping reads
- Sequence/ PCR errors
- Allele-capture bias
- Low frequency mutations
A cyclical process can help develop or hone mutation detection
Call mutations Verification by alternate technology Evaluate sensitivity/ specificity Update algorithms or methods
Acknowledgements
QCMG Sean Grimmond Peter Wilson Deborah Gwynne Genome Biology Bioinformatics Sequencing
Nic Waddell Karin Kassahn John Pearson David Miller Nicole Cloonan Katia Nones Darrin Taylor Craig Nourse Keerthana Krishnan Shiv Hiriyur Nagaraj Scott Wood Tim Bruxner Shivangi Wani Conrad Leonard Ehsan Nourbakhsh David Wood Oliver Holmes Suzanne Manning Jason Steen Christina Xu Ivon Harliwong Alan Robertson Matt Anderson SenelI drisoglu Kelly Quek Felicity Newell Angelika Christ Anita Steptoe Lynn Fink Sarah Song
Life Technologies
John Sheppard Emma Campbell Evgeny Glazov