Population-based Detection of Structural Variants in Normal and Aberrant Genomes.
Jean Monlong
Guillaume Bourque’s group
Genome Informatics - September 21-24, 2014 Human Genetics Dept.
1 / 19
Population-based Detection of Structural Variants in Normal and - - PowerPoint PPT Presentation
Population-based Detection of Structural Variants in Normal and Aberrant Genomes. Jean Monlong Guillaume Bourques group Genome Informatics - September 21-24, 2014 Human Genetics Dept. 1 / 19 Structural variation Genetic variation
Guillaume Bourque’s group
1 / 19
Baker 2012, Nature Methods. Raphael Lab, Brown University.
Structural Variant: SV; Copy Number Variation: CNV.
2 / 19
Baker 2012, Nature Methods. 3 / 19
Low mappability
◮ Noisy or reduced signal in repeat-rich regions, centromeres, telomeres. ◮ Unpredictable segmentation → reduced sensitivity/specificity. ◮ Filtering problematic regions reduces the genome range tested. genomic window number of reads mapped genomic window number of reads mapped
4 / 19
genomic window number of reads mapped
sample reference tested 5 / 19
genomic window number of reads mapped
sample reference tested
6 / 19
◮ Experiment-specific technical bias. ◮ Naive normalization (linear, quantile) is often not enough.
0.00 0.05 0.10 0.15 0.20
RS114677 K2310006 LR354 RS114676 RS114604 RS114528 K2310078 K2310004 RS114674 LR398 RS114605 LR405 K2110089 K2310061 LR417 RS114585 LR340 K2150051 LR364 K2310024 LR422 K2310030 K2310008 K2150053 LR380 RS114636 K2150052 K2310001 K2150045 K2310090 K2310080 RS114624 RS114539 RS114606 LR377 LR370.2 LR370 K2310038 K2110093 LR407 RS114646 RS114494 K2310007 K2150047 LR390 LR344 K2110118 LR371 RS114527 LR382 K2310025 K2110060 LR357 K2110078 RS114472 LR420 K2150024 K2110106 RS114511 RS114541 RS114563 LR404 LR389 RS114912 RS114728 RS114719 LR426 LR423 LR358 K2110068 LR413 K2110061 K2110073 K2110056 RS114532 K2150006 K2110059 K2110126 K2110085 K2110112 LR396 K1630028 K2110079 K1610359 K1620380 RS114670
sample propotion of the studied genome
coverage highest lowest 7 / 19
◮ PCA-based normalization (Krumm, 2012; Boeva, 2014). ◮ Targeted normalization: linear using a subset of the genome.
Ref1 Ref2 Ref3 Ref4 T est T est
8 / 19
For a sample s:
◮ For each bin b: z = BC b
s −BC b reference
sdb
reference
◮ pv = P(|z| ≤ |Z|) with Z ∼ N(0, σ) where σ is estimated from the z
distribution across all bins.
0.0 0.1 0.2 0.3 0.4 0.5 −5.0 −2.5 0.0 2.5 5.0
Z−scores density
normalization targeted median median+variance quantile
9 / 19
◮ Normal samples → reference samples. ◮ 2kb bins.
◮ concordant reads: only properly paired and mapped read
◮ discordant reads: improperly mapped read pairs or low
10 / 19
−20 −10 10 20 −20 −10 10 20
normal sample Z−score tumor sample Z−score
nb of bins (0,1] (1,5] (5,10] (10,100] (100,1e+03] (1e+03,Inf]
“funky snowman” plot
11 / 19
4000 6000 135.11 135.13 135.15
position (Mb) read coverage
normal sample: D000GQ9
normal normal samples
Chr.10, overlapping genes (PRAP1, CALY), not detected by other approaches. 12 / 19
4000 100.75 100.80 100.85 100.90 100.95 101.00
position (Mb) read coverage
tumor sample: D000GMU
normal normal samples
Chr.1, overlapping CDC14A gene (cell division cycle), not detected by other approaches. 13 / 19
◮ Germline events detected in tumor samples ? ◮ Consistent with SNP-array calls ? ◮ Twin dataset: consistent with the pedigree ?
Germline events detected in tumor samples
FREEC PopSV cn.MOPS FREEC PopSV all events low mappability 200 400 600
number of germline events in tumor
all events low mappability 0.00 0.25 0.50 0.75 1.00
proportion of germline events in tumor
PopSV detected more consistent calls than other methods with similar specificity.
14 / 19
0.00 0.25 0.50 0.75 1.00 5 10 15 20
distance to centromere/telomere/gap (Mb) CNV frequency in normals
method cn.MOPS FREEC PopSV 15 / 19
◮ Discordant reads support SVs. ◮ Goal: robust detection of an excess of discordant reads
◮ Challenging to estimate a background/expected model.
◮ Heterogeneous coverage ⇒ hybrid Poisson-Normal Z-score. ◮ Targeted normalization from PopSV on concordant reads.
16 / 19
0.1 0.2 0.3 0.4 0.5 (0,2] (2,3] (3,4] (4,5] (5,10] (10,20] (20,50] (50,100] (100,Inf]
number of supporting reads in BreakDancer proportion of BreakDancer calls
BreakDancer only BreakDancer + PopSV
BreakDancer: SV caller using paired-end mapping information (Chen, 2009). 17 / 19
◮ Superior to other Read-Depth methods. ◮ Wider range of the genome tested. ◮ Detection in low mappability regions and partial tumoral
◮ More than an CNV caller.
◮ Excess of discordant read pairs. ◮ Combination with orthogonal approaches (PEM, Assembly).
◮ Custom binning: repeat annotation, Whole-Exome
18 / 19
◮ Guillaume Bourque ◮ Mathieu Bourgey ◮ Louis Letourneau ◮ Francois Lefebvre ◮ Eric Audemard ◮ Toby Hocking ◮ Simon Girard ◮ Simon Gravel ◮ Mathieu Blanchette ◮ Mehran Karimzadeh Reghbati
19 / 19
20 / 19
FREEC PopSV cn.MOPS FREEC PopSV loose stringent 0.00 0.25 0.50 0.75 1.00
proportion of SNP−array GS event also in WGS calls
21 / 19
200 400 600 800 1 2 3 4 5
copy number estimate number of events
22 / 19
pca tn 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 D000GNY D000GO1 D000GOC D000GQK −20 20 −20 20
z count
23 / 19
0.0 0.2 0.4 None Simple_repeat Satellite DNA LTR SINE LINE
Class of the repeat overlapping BreakDancer call proportion of BreakDancer calls
BreakDancer only BreakDancer + PopSV
BreakDancer: SV caller using paired-end mapping information (Chen, 2009). 24 / 19