Genotyping structural variants in TOPMed using pangenome graphs
Jean Monlong February 12-13, 2020
GSP-TOPMed Analysis Workshop
Genotyping structural variants in TOPMed using pangenome graphs - - PowerPoint PPT Presentation
Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13, 2020 GSP-TOPMed Analysis Workshop Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G
GSP-TOPMed Analysis Workshop
Introduction 2
T T G G G G G C C
T T G G G G G C C
1KGP 1KGP 1KGP
Introduction 3
Linear reference genome Variation graph
DELETION INSERTION DELETION
Introduction 4
Garrison et al. Nature Biotech 2018
Hickey et al. bioRxiv 2019, in press at Genome Biology
SV genotyping with vg 5
HG00514 VCF VCF
HG00514 HG00514
SV genotyping with vg 6
whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
F1
g r a p h
a s e d S V g e n
y p e r s t r a d i t i
a l S V g e n
y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats
SV genotyping with vg 7
5000 10000 15000 50 100 1,000 10,000 100,000
size (bp) variant SV type
DEL INS
SV genotyping in the BioData Catalyst ecosystem 8
SV genotyping in the BioData Catalyst ecosystem 9
SV genotyping in the BioData Catalyst ecosystem 10
SV genotyping in the BioData Catalyst ecosystem 11
SVs across 760 samples 12
SVs across 760 samples 13
SVs across 760 samples 14
Screenshots from https://gnomad.broadinstitute.org/
SVs across 760 samples 15
S 1 S 2 S 3 Truth set S1 Paragraph S1 vg SV catalog Sample 1 Paragraph Sample 1 vg Deletions
SVs across 760 samples 16
Scale chr11: RepeatMasker 200 bases hg38 639,600 639,700 639,800 639,900 640,000 640,100 640,200 640,300 640,400 GENCODE v32 Comprehensive Transcript Set (only Basic displayed by default) OMIM Genes - Dark Green Can Be Disease-causing Repeating Elements by RepeatMasker Simple Tandem Repeats by TRF DRD4 126452 CGCCGCCCTCCCG... CGCCCCCCGCGCC...
Conclusions and future directions 17
Short reads
vg
Short-read studies
gnomAD, TOPMed SV-WG
TOPMed
GRCh38 Human Pangenome
High-quality phased assemblies
Long-read studies
HGSVC, SVPOP, GIAB Phenotypes SV genotypes
0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1
vg
Association study Annotated SV catalog
Structural Variation
Conclusions and future directions 17
Short reads
vg
Short-read studies
gnomAD, TOPMed SV-WG
TOPMed
GRCh38 Human Pangenome
High-quality phased assemblies
Long-read studies
HGSVC, SVPOP, GIAB Phenotypes SV genotypes
0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1
vg
Association study Annotated SV catalog
Structural Variation
Acknowledgment 18
19
20
reads not mapped
Linear reference Graph reference
insertion reads insertion deletion Snarl 1 Path coverage ratio 1:1.6 → het Snarl 2 Path coverage ratio 0:2 → hom Read mapping to reference path Variant path Reference path Read mapping to variant path
21
whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
F1
g r a p h
a s e d S V g e n
y p e r s t r a d i t i
a l S V g e n
y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats
22
deletion 3’ UTR of LONRF2 gene reads graph GRCh38 chr2
51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.
23
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
precision recall repeat class/family
SINE/Alu LTR/ERV1 LINE/L1 Retroposon/SVA Low_complexity Satellite Satellite/centr Simple_repeat
SV type
INS DEL SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.
24
25
26