Genotyping structural variants in pangenome graphs using the vg toolkit
Jean Monlong November 7, 2019
Genome Informatics
Genotyping structural variants in pangenome graphs using the vg - - PowerPoint PPT Presentation
Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C
Genome Informatics
Introduction 2
T T G G G G G C C
T T G G G G G C C
1KGP 1KGP 1KGP
Introduction 3
Linear reference genome Variation graph
DELETION INSERTION DELETION
Introduction 4
HGSVC gnomad−SV 10 100 1,000 10,000 100,000 2000 4000 6000 20000 40000 60000
size (bp) variant SV type
DEL INS
Goal 5
Garrison et al. Nature Biotech 2018
Hickey et al. bioRxiv 2019
From SV catalogs in human 6
From SV catalogs in human 7
reads not mapped
Linear reference Graph reference
insertion reads insertion deletion Snarl 1 Path coverage ratio 1:1.6 → het Snarl 2 Path coverage ratio 0:2 → hom Read mapping to reference path Variant path Reference path Read mapping to variant path
From SV catalogs in human 8
<10% rec. overlap At least 50% coverage and 10% reciprocal overlap <50% coverage
20bp 20bp
At least 50% of inserted sequence matching nearby insertions
From SV catalogs in human 9
whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
F1
g r a p h
a s e d S V g e n
y p e r s t r a d i t i
a l S V g e n
y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats
From SV catalogs in human 10
whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
F1
g r a p h
a s e d S V g e n
y p e r s t r a d i t i
a l S V g e n
y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats
From SV catalogs in human 11
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
precision recall repeat class/family
SINE/Alu LTR/ERV1 LINE/L1 Retroposon/SVA Low_complexity Satellite Satellite/centr Simple_repeat
SV type
INS DEL SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.
From de novo assemblies in yeast 12
VCF v4.2 specs
From de novo assemblies in yeast 13
Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference Illumina reads VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph
From de novo assemblies in yeast 14
Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph short reads Evaluation
From de novo assemblies in yeast 14
Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph short reads Evaluation
From de novo assemblies in yeast 15
UFRJ50816 YPS138 N44 CBS432 UWOPS034614 YPS128 Y12 SK1 DBVPG6765 DBVPG6044
0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8
VCF: average mapping identity Cactus: average mapping identity during graph construction
included
clade
paradoxus
Conclusions and future directions 16
Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/
Conclusions and future directions 16
Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/
Acknowledgment 17
18
19
Paragraph BayesTyper SVTyper Delly Genotyper SMRT−SV v2 Genotyper 0.9 1.0 1.1 1.2
average number of genotyped calls per truth call method experiment
GIAB CHM−PD SVPOP
type
DEL S 1 S 2 S 3 Truth set S1 Paragraph S1 vg
20
deletion 3’ UTR of LONRF2 gene reads graph GRCh38 chr2
51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.
21
True SVs in VCF Errors in VCF INS DEL INV 1 3 7 13 20 1 3 7 13 20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Depth Best F1 Method
vg Paragraph BayesTyper SVTyper Delly Genotyper
22
HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP INS DEL all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Genomic regions Best F1 Method
vg Paragraph BayesTyper SVTyper Delly Genotyper SMRT−SV v2 Genotyper
SV evaluation
presence genotype
23
DEL 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Precision Genomic regions
non−repeat
Method
Paragraph BayesTyper SVTyper Delly Genotyper
24
INS all regions INS non−repeat regions DEL all regions DEL non−repeat regions HGSVC simulated reads HGSVC real reads GIAB
[ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Size (bp) F1 score
25
YPS138 N44 CBS432 UWOPS034614 YPS128 Y12 SK1 DBVPG6765 DBVPG6044
20 25 30 35 40 20 25 30 35 40
VCF: average mapping quality Cactus: average mapping quality during graph construction
included
clade
paradoxus