[PPT] - Population-based Detection of Structural Variants in Normal and PowerPoint Presentation

SLIDE 1

Population-based Detection of Structural Variants in Normal and Aberrant Genomes.

Jean Monlong

Guillaume Bourque’s group

Genome Informatics - September 21-24, 2014 Human Genetics Dept.

1 / 19

SLIDE 2

Structural variation

Genetic variation involving more than 500bp.

Baker 2012, Nature Methods. Raphael Lab, Brown University.

Structural Variant: SV; Copy Number Variation: CNV.

2 / 19

SLIDE 3

SV detection using High-Throughput Sequencing

Baker 2012, Nature Methods. 3 / 19

SLIDE 4

Limitation

Low mappability

◮ Noisy or reduced signal in repeat-rich regions, centromeres, telomeres. ◮ Unpredictable segmentation → reduced sensitivity/specificity. ◮ Filtering problematic regions reduces the genome range tested. genomic window number of reads mapped genomic window number of reads mapped

4 / 19

SLIDE 5

Objective

Test the entire genome, including low-mappability regions, and detect subtle abnormal coverage.

PopSV: Population-based approach

Use a set of reference experiments to detect abnormal patterns.

genomic window number of reads mapped

sample reference tested 5 / 19

SLIDE 6

PopSV: Population-based approach

genomic window number of reads mapped

sample reference tested

Workflow

1. Genome is fragmented in bins.
2. Reads in each bin are counted, for each sample.
3. Normalization of the bin counts.
4. Each sample and each bin is tested for divergence from

reference samples (Z-score).

5. P-value estimation and multiple test correction.

6 / 19

SLIDE 7

PopSV: importance of normalization

◮ Experiment-specific technical bias. ◮ Naive normalization (linear, quantile) is often not enough.

0.00 0.05 0.10 0.15 0.20

RS114677 K2310006 LR354 RS114676 RS114604 RS114528 K2310078 K2310004 RS114674 LR398 RS114605 LR405 K2110089 K2310061 LR417 RS114585 LR340 K2150051 LR364 K2310024 LR422 K2310030 K2310008 K2150053 LR380 RS114636 K2150052 K2310001 K2150045 K2310090 K2310080 RS114624 RS114539 RS114606 LR377 LR370.2 LR370 K2310038 K2110093 LR407 RS114646 RS114494 K2310007 K2150047 LR390 LR344 K2110118 LR371 RS114527 LR382 K2310025 K2110060 LR357 K2110078 RS114472 LR420 K2150024 K2110106 RS114511 RS114541 RS114563 LR404 LR389 RS114912 RS114728 RS114719 LR426 LR423 LR358 K2110068 LR413 K2110061 K2110073 K2110056 RS114532 K2150006 K2110059 K2110126 K2110085 K2110112 LR396 K1630028 K2110079 K1610359 K1620380 RS114670

sample propotion of the studied genome

coverage highest lowest 7 / 19

SLIDE 8

PopSV: importance of normalization

◮ PCA-based normalization (Krumm, 2012; Boeva, 2014). ◮ Targeted normalization: linear using a subset of the genome.

Ref1 Ref2 Ref3 Ref4 T est T est

8 / 19

SLIDE 9

PopSV: Z-score and test

For a sample s:

◮ For each bin b: z = BC b

s −BC b reference

sdb

reference

◮ pv = P(|z| ≤ |Z|) with Z ∼ N(0, σ) where σ is estimated from the z

distribution across all bins.

0.0 0.1 0.2 0.3 0.4 0.5 −5.0 −2.5 0.0 2.5 5.0

Z−scores density

normalization targeted median median+variance quantile

9 / 19

SLIDE 10

Application

CageKid : Renal Cell Cancer

Whole-Genome Sequencing of 100 individuals, ∼ 40X coverage, Illumina paired-end 100bp, normal and tumor paired samples.

◮ Normal samples → reference samples. ◮ 2kb bins.

Read-Depth measure - 2 strategies

◮ concordant reads: only properly paired and mapped read

pairs.

◮ discordant reads: improperly mapped read pairs or low

mapping quality.

10 / 19

SLIDE 11

Using concordant reads

−20 −10 10 20 −20 −10 10 20

normal sample Z−score tumor sample Z−score

nb of bins (0,1] (1,5] (5,10] (10,100] (100,1e+03] (1e+03,Inf]

“funky snowman” plot

11 / 19

SLIDE 12

Example: Telomeric region

2000

4000 6000 135.11 135.13 135.15

position (Mb) read coverage

normal sample: D000GQ9

abnormal

normal normal samples

Chr.10, overlapping genes (PRAP1, CALY), not detected by other approaches. 12 / 19

SLIDE 13

Example: Partial tumoral event

2000

4000 100.75 100.80 100.85 100.90 100.95 101.00

position (Mb) read coverage

tumor sample: D000GMU

abnormal

normal normal samples

Chr.1, overlapping CDC14A gene (cell division cycle), not detected by other approaches. 13 / 19

SLIDE 14

Validation and benchmark

◮ Germline events detected in tumor samples ? ◮ Consistent with SNP-array calls ? ◮ Twin dataset: consistent with the pedigree ?

Germline events detected in tumor samples

●
●
cn.MOPS

FREEC PopSV cn.MOPS FREEC PopSV all events low mappability 200 400 600

number of germline events in tumor

●

all events low mappability 0.00 0.25 0.50 0.75 1.00

proportion of germline events in tumor

PopSV detected more consistent calls than other methods with similar specificity.

14 / 19

SLIDE 15

Centromere/telomere/gap and systematic errors

0.00 0.25 0.50 0.75 1.00 5 10 15 20

distance to centromere/telomere/gap (Mb) CNV frequency in normals

method cn.MOPS FREEC PopSV 15 / 19

SLIDE 16

PopSV using discordant reads

◮ Discordant reads support SVs. ◮ Goal: robust detection of an excess of discordant reads

genome-wide.

◮ Challenging to estimate a background/expected model.

Usage

PopSV flags abnormal regions for further characterization using

rthogonal approaches.

Discordant versus concordant reads

◮ Heterogeneous coverage ⇒ hybrid Poisson-Normal Z-score. ◮ Targeted normalization from PopSV on concordant reads.

16 / 19

SLIDE 17

PopSV and BreakDancer

0.0

0.1 0.2 0.3 0.4 0.5 (0,2] (2,3] (3,4] (4,5] (5,10] (10,20] (20,50] (50,100] (100,Inf]

number of supporting reads in BreakDancer proportion of BreakDancer calls

BreakDancer only BreakDancer + PopSV

BreakDancer: SV caller using paired-end mapping information (Chen, 2009). 17 / 19

SLIDE 18

Conclusion

PopSV: Robust and sensitive approach

◮ Superior to other Read-Depth methods. ◮ Wider range of the genome tested. ◮ Detection in low mappability regions and partial tumoral

signal.

Work in progress

◮ More than an CNV caller.

◮ Excess of discordant read pairs. ◮ Combination with orthogonal approaches (PEM, Assembly).

◮ Custom binning: repeat annotation, Whole-Exome

Sequencing.

18 / 19

SLIDE 19

Acknowledgment

◮ Guillaume Bourque ◮ Mathieu Bourgey ◮ Louis Letourneau ◮ Francois Lefebvre ◮ Eric Audemard ◮ Toby Hocking ◮ Simon Girard ◮ Simon Gravel ◮ Mathieu Blanchette ◮ Mehran Karimzadeh Reghbati

19 / 19

SLIDE 20

Thank You !

20 / 19

SLIDE 21

SNP-array concordance

cn.MOPS

FREEC PopSV cn.MOPS FREEC PopSV loose stringent 0.00 0.25 0.50 0.75 1.00

proportion of SNP−array GS event also in WGS calls

21 / 19

SLIDE 22

Copy-number distribution

200 400 600 800 1 2 3 4 5

copy number estimate number of events

22 / 19

SLIDE 23

PCA vs Targeted normalization in tumor samples

pca tn 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 D000GNY D000GO1 D000GOC D000GQK −20 20 −20 20

z count

23 / 19

SLIDE 24

PopSV and BreakDancer

DEL

0.0 0.2 0.4 None Simple_repeat Satellite DNA LTR SINE LINE

Class of the repeat overlapping BreakDancer call proportion of BreakDancer calls

BreakDancer only BreakDancer + PopSV

BreakDancer: SV caller using paired-end mapping information (Chen, 2009). 24 / 19