[PPT] - Mutation detection in massively parallel sequencing 2012 Winter PowerPoint Presentation

SLIDE 1

Mutation detection in massively parallel sequencing

2012 Winter School in Mathematical and Computational Biology

Ann-Marie Patch

SLIDE 2

Sequencing literature

Virology – 11 new complete genomes in this month’s issue of Journal of Virology

From medicine to evolution - large scale sequencing data is impacting all of our research

SLIDE 3

Sequencing literature

Bacterial genome sequencing BMC Microbiology

From medicine to evolution - large scale sequencing data is impacting all of our research

SLIDE 4

Sequencing literature

Fungi and Plant genomes sequencing

From medicine to evolution - large scale sequencing data is impacting all of our research

SLIDE 5

Sequencing literature

Vertebrate Evolution nature

From medicine to evolution - large scale sequencing data is impacting all of our research

SLIDE 6

Sequencing literature

Human cancer genetics nature

From medicine to evolution - large scale sequencing data is impacting all of our research

SLIDE 7

Why do we want to sequence?

ICGC aims to obtain a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical importance across the globe http://www.icgc.org/

International Cancer Genome Consortium Nature 2010website

Basic level: Comparing the characteristics of two or more populations of cells and recording the differences

The 1000 Genomes Project is an international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies

The 1000 Human Genomes Project Consortium Nature 2010 http://www.1000genomes.org/

Talk on metagenomics tomorrow morning

SLIDE 8

How many changes are we going to find ?

Human genome Illustration The genomes of any two people will typically have: 1 single base difference every ~1000bp (0.1%) Add in all other classes of mutations and typically <0.5% of two human genomes are different In Craig Venter’s genome 4.1 million DNA variants were reported (12,500 in the exome ) encompassing in total a huge 12.3 Mb (0.41%)

Levey et al 2007, Ng et al 2008

SLIDE 9

What sort of differences will we find in DNA sequencing?

Mutation = a change in nucleic acid sequence

Single Nucleotide variations (SNV)
Small insertions and deletions (INDELS)
Large chromosome rearrangements
Copy number changes

Cloonan 2010

RNA talks later today

SLIDE 10

Mutation Detection

Sequencing Basics
Small mutations, SNVs and indels
Genomic rearrangements
Copy number changes
Verification

SLIDE 11

Sequencing process recap - library preparation

Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Paired-end sequencing

SLIDE 12

Sequencing process recap - library preparation

Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Align back to a reference genome

SLIDE 13

I II I I II

Reference genome

Read pairs are mapped to a reference genome

How many reads stack on top of each other at any one position is called the coverage depth Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutations and structural rearrangements

Paired-end sequences mapped to genome Coverage depth

SLIDE 14

I II I I II

Reference genome

Read pairs are mapped to a reference genome We convert our data into positional information and counts

“How many bases out of the total are different to the reference at any position”

* * *

Single Nucleotide Variant Copy Number Loss Homozygous Copy Number Loss Heterozygous Large chromosome rearrangement Small insertions and deletions (INDELs) Copy Number Gain

SLIDE 15

Sequencing Basics
Small mutations, SNVs and indels
Genomic rearrangements
Copy number changes
Verification

Mutation Detection

SLIDE 16

Detecting single nucleotide variants (SNVs)

ACGATATTACACGTACACTCAAGTCGTTCGGAACCT ACGATATTACACGTACATTCAAATCGT ACGTTATTACACGTACATTCAACTCGT ACGATATTACACGCACATTCAAGTCGT CGATCTTACACGTACATTCAAGTCGTT ATATTTCACGTACATTCAAGTCGTTCG ATATTAAA-GTACATTCAAGTCGTTCG ATTACACGTACATTCAAGTCGATCGGA ATTACACGTACATTCACGTCGTTCGGA CACGTACATTCGAGTCGTTCGGAACCT

----------------T------------------

SNV call Aligned Reads Reference Mutation = Homozygous 18 C>T 9 bases out of a total of 9 reads covering this position do not match the reference

Coverage

10 20 30

SLIDE 17

What counts are acceptable?

4 out of 9 Heterozygous C>T 2 out of 9 Heterozygous C>T ? How good is the data?

SNV call SNV call

SLIDE 18

Controlling the quality of sequencing data

Filtering data can take place at various stages Pre-alignment

Remove or trim reads where base quality is low

e.g. SolexaQA (Cox et al 2010)

Per base quality at the 3’ end of sequence reads Quality

76 75 74 73 72 71 70 69 68 67 66 65 64

q20

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

Talk this afternoon on trimming and errors

SLIDE 19

Control the quality of input data before calling variants

Alignment thresholds

Set minimums for mapping quality

Post-alignment

Marking duplicates e.g. Picard (http://picard.sourceforge.net)
Set maximum number of mismatches for a read
Flagging reads that map to more than one location in the genome

3 mismatches PCR duplicate reads

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 20

Software for calling variants

Many software tools available ...more than are listed here

GATK – McKenna et al 2010 Genome Res SAMtools (mpileup and BCFtools) Li 2009 Bioinformatics DiBayes – SOLiD software http://www.lifetechnologies.com InGAP – Qi 2011 Nucle Acids Res MAQGene – Bigelow 2009 Nat. Methods C. elegans only PolyBayesShort - http://bioinformatics.bc.edu/marthlab/PbShort SomaticSniper – Larson 2012 Bioinformatics Sniper – Simola 2011 Genome Biol Strelka – Saunders 2012 Bioinformatics Dindel – Albers 2011 Genome Res SNiPlay – Dereeper 2011 BMC Bioinformatics SRiC – Zang 2011 BMC Genomics qSNP – QCMG manuscript in preparation http://seqanswers.com/wiki/Software/list

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 21

Evaluate the overall calling of variants by the software

SNP concordance with genotyping arrays Germline variants Illumina array (+) Illumina array (-) SOLiD sequencing (+)

339,935 1,453

SOLiD sequencing (-)

5,806 434,554

(+) variant called by technology (-) variant not called

Sensitivity 97% Specificity 99% Effective median coverage 37

TP/(TP+FN) TN/(TN+FP) True Positives

Genotyping array calls Sequencing calls

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants False Positives False Negatives True Negatives

SLIDE 22

Visualise the variants called by the software

Visualising SNV calls in IGV

(IGV: http://www.broadinstitute.org/software/igv/)

Assess coverage and quality Check for hidden duplicates Examine sequence context

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 23

Visualise the variants called by the software

Visualising small INDELS in IGV

(IGV: http://www.broadinstitute.org/software/igv/)

Assess coverage and quality Check for hidden duplicates Examine sequence context

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 24

Solutions for annotating variants

SeattleSeq - http://snp.gs.washington.edu/SeattleSeqAnnotation131/ MU2A - Garla V et al 2010. Bioinformatics Segtor - Renaud et al 2011 Plos One Galaxy - http://galaxy.psu.edu/ ANNOVAR - http://www.openbioinformatics.org/annovar/ Ensembl Perl API - http://www.ensembl.org

And more ...

Downstream, Upstream (5kb)
Intergenic
Intronic
Essential Splice site
5’UTR, 3’UTR
Synonymous Coding
Non-Synonymous Coding
Stop gained, Stop lost
Within non-coding gene
mi RNA

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 25

Measuring how damaging a variant might be

Polyphen Adzhubei 2010, Sift Kumar 2009, MutationTaster Schwarz 2010 Polyphen2 (http://genetics.bwh.harvard.edu/pph2/)

Rank variants on a number of characteristics

Conservation of amino acid across species e.g. GERP, phastCons, PhyloP
Assess potential for damage using e.g. Sift, PolyPhen2, MutationTaster
Manual curation i.e. Variation within a known candidate gene or locus of

interest

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

Talk on Thursday about pathway analysis

GERP Davydov 2010, Goode 2010 phastCons and phyloP Pollard 2009, Siepel 2005

SLIDE 26

Pre-filter:

remove duplicates
alignment length >34 or (F5 and in proper pair)
mapping quality > 14
less than 3 mismatches to reference

qSNP:

Pileup of variants in Tumour and Normal bams
Coverage minimum of 12 reads in the normal
Calls somatic if not in pileup of matched normal
Flags if variant has been seen in a the normal of another patient
Annotation using Ensembl API

Evaluation of variants:

> 3 novel starts supporting mutation/variant
Coverage
IGV review

Summary of workflow for mutation detection QCMG workflow

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 27

Sequencing Basics
Small mutations, SNVs and indels
Genomic rearrangements
Copy number changes
Verification

Mutation Detection

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 28

Genomic rearrangements

Genomic variants arise typically from incorrectly repaired DNA damage

Deletions Insertions Translocations Spectral Karyotype Analysis (SKY) Edwards et al 2010 Journal of Pathology Other detection methods include Microscopy based Metaphase spread karyotype, Florescent in-situ hybridisation (FISH) and Spectral Karyotype analysis Array based Array CGH (comparative genomic hybridisation) High density genotyping arrays

SLIDE 29

Paired-end sequencing detection of genomic rearrangements

Deletion example Library preparation size selection is key

200-400 bp

Expected Distance Between Reads

Size range Pair of reads map across a deletion present in your sample When they are mapped back to the reference genome that does not have that deletion The distance between the two reads is bigger than the expected size range Size selected sheared genomic DNA

200 400

DNA repaired with some sequence missing

SLIDE 30

Discordant pair mapping to identify genomic rearrangements

Sample Reference Sample Reference Sample Reference Discordant pair identification tools qSV – in house developed tool BreakDancer – Chen et al 2009 SVDetect – Zeitouni et al 2010 All use the unexpected distance or orientation of read pairs to identify genomic rearrangements Evidence of a rearrangement is inferred from the number of abnormally mapping read pairs clustering at genomic positions

SLIDE 31

Genomic rearrangements detected by soft clipping clusters

Soft clipping will often happen around breakpoints of genomic rearrangements (if your aligner supports soft clipping) Tools have been developed that search for clusters of reads with soft clips and performs remapping CREST – Wang et al 2011 ClipCrop – Suzuki et al 2011

Suzuki et al 2011

SLIDE 32

Genomic rearrangement detection by split read clustering

Similarly, if your aligner allows large gaps when mapping reads back to the reference genome split reads will be found at breakpoints of genomic rearrangements Pindel (Ye et al 2009) is a tool developed to search for clusters of split reads

diagram, Ye et al 2009 Alkan 2011 review article

SLIDE 33

Sequencing Basics
Small mutations, SNVs and indels
Genomic rearrangements
Copy number changes
Verification

Mutation Detection

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 34

Copy number mutations

Detection of copy number changes can be achieved by measuring signal intensities on genotyping SNP chips or array CGH e.g. Illumina SNP arrays report Log R ratio and B-allele frequency for each probe We can interpret these signals as changes in copy number They are low resolution

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

CN=2 CN=1 CN=3

Log R Ratio B allele Frequency http://www.illumina.com

SLIDE 35

I II I I II

Reference genome

Copy number mutations

* * *

Copy Number Loss Homozygous Copy Number Loss Heterozygous Copy Number Gain

SLIDE 36

Copy number changes in sequencing data

Coverage depth is not even across the genome

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

e.g. Coverage varies across genomes due to ‘mapability’ of repeated sequence regions This is not evidence of copy number changes

McKenna et al 2010 (and see Koehler et al 2010)

SLIDE 37

CNV often associated with a SV

CNVnator (Abyzov et al 2011) is a tool to identify copy number variants from read depth partitioning and GC content correction Copy number variants are also able to be identified in targeted capture sequencing

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

Genomic position

SLIDE 38

Visualisation of genomic rearrangements and copy number variants

Krzywinski et al 2009

Circos is a visualisation tool for creating attractive plots of genomic data tracks

chromosomes SNP array track that shows copy number gain in red and loss in green SNP array track that shows heterozygous and homozygous regions

Centre sequencing data

Translocations => blue Deletions => green Duplications => red Inversions => orange intrachromosomal Rearrangements => light blue

SLIDE 39

Combining copy number and genomic rearrangements

Stephens et al 2011 Cell

Copy number changes and genomic variants are often associated Combining these data can be challenging

SLIDE 40

Sequencing Basics
Small mutations, SNVs and indels
Genomic rearrangements
Copy number changes
Verification

Mutation Detection

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

SLIDE 41

Verification of variants is an essential part of the workflow

Capillary sequencing

Verify ideally with a different technology Verification of variants called is a useful internal quality check Feedback loops should inform the whole process It allows further development of the variant calling workflow To balance sensitivity and specificity

Non optical sequencing

Variant calling Pre-filter reads Evaluate Variants Annotate Variants Rank Variants Verify Variants

Sanger method sequencing Ion Torrent Life Technologies

SLIDE 42

Summary

Challenges in mutation detection

Poisson sampling (low coverage

regions)

Mismapping reads
Sequence/ PCR errors
Allele-capture bias
Low frequency mutations

A cyclical process can help develop or hone mutation detection

Call mutations Verification by alternate technology Evaluate sensitivity/ specificity Update algorithms or methods

SLIDE 43

Acknowledgements

QCMG Sean Grimmond Peter Wilson Deborah Gwynne Genome Biology Bioinformatics Sequencing

Nic Waddell Karin Kassahn John Pearson David Miller Nicole Cloonan Katia Nones Darrin Taylor Craig Nourse Keerthana Krishnan Shiv Hiriyur Nagaraj Scott Wood Tim Bruxner Shivangi Wani Conrad Leonard Ehsan Nourbakhsh David Wood Oliver Holmes Suzanne Manning Jason Steen Christina Xu Ivon Harliwong Alan Robertson Matt Anderson SenelI drisoglu Kelly Quek Felicity Newell Angelika Christ Anita Steptoe Lynn Fink Sarah Song

Life Technologies

John Sheppard Emma Campbell Evgeny Glazov

SLIDE 44

published pipeline references Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. San Lucas FA et al. Bioinformatics. 2012 Detecting and annotating genetic variations using the HugeSeq pipeline. Lam HY et al. Nat Biotechnol. 2012