Genotyping structural variants in pangenome graphs using the vg - - PowerPoint PPT Presentation

genotyping structural variants in pangenome graphs using
SMART_READER_LITE
LIVE PREVIEW

Genotyping structural variants in pangenome graphs using the vg - - PowerPoint PPT Presentation

Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C


slide-1
SLIDE 1

Genotyping structural variants in pangenome graphs using the vg toolkit

Jean Monlong November 7, 2019

Genome Informatics

slide-2
SLIDE 2

Introduction 2

Pangenome graphs and variant-aware read mapping

A C A T

T T G G G G G C C

C

T T G G G G G C C

Linear reference genome Variation graph

1KGP 1KGP 1KGP

G C G C G

  • Seq. reads
  • Seq. reads
slide-3
SLIDE 3

Introduction 3

Mapping reads across structural variants

Structural variants are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations.

C

Linear reference genome Variation graph

G C G C G

DELETION INSERTION DELETION

slide-4
SLIDE 4

Introduction 4

SV catalogs from long-read sequencing studies

Ref. Project Samples Chaisson et al. 2019 Human Genome Structural 3 Variation Consortium (HGSVC) Audano et al. 2019 SVPOP 15 Zook et al. 2019 Genome in a Bottle (GIAB) 1

HGSVC gnomad−SV 10 100 1,000 10,000 100,000 2000 4000 6000 20000 40000 60000

size (bp) variant SV type

DEL INS

slide-5
SLIDE 5

Goal 5

The vg toolkit is a complete, open source solution for graph construction, read mapping, and variant calling. https://github.com/vgteam/vg

Garrison et al. Nature Biotech 2018

Can we genotype SVs from short-read sequencing datasets with the vg toolkit?

Starting from public SV catalogs or de novo assemblies.

Hickey et al. bioRxiv 2019

slide-6
SLIDE 6

From SV catalogs in human 6

Genotyping public SV catalogs in human

short reads vg vg

HG00514 VCF VCF

HGSVC SV catalog genotyped SVs Evaluation

HG00514 HG00514

GRCh38

Evaluate genotype predictions for a sample from the truth set (e.g. HG00514).

slide-7
SLIDE 7

From SV catalogs in human 7

Genotyping variants in vg

reads not mapped

  • n linear reference

Linear reference Graph reference

insertion reads insertion deletion Snarl 1 Path coverage ratio 1:1.6 → het Snarl 2 Path coverage ratio 0:2 → hom Read mapping to reference path Variant path Reference path Read mapping to variant path

Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”.

slide-8
SLIDE 8

From SV catalogs in human 8

Evaluating SV genotypes with a truth set

truth calls Deletions/Inversions

<10% rec. overlap At least 50% coverage and 10% reciprocal overlap <50% coverage

truth calls

20bp 20bp

Insertions

At least 50% of inserted sequence matching nearby insertions

R package: https://github.com/jmonlong/sveval

slide-9
SLIDE 9

From SV catalogs in human 9

Results on HGSVC - Simulated reads

whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

F1

g r a p h

  • b

a s e d S V g e n

  • t

y p e r s t r a d i t i

  • n

a l S V g e n

  • t

y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats

slide-10
SLIDE 10

From SV catalogs in human 10

Results on HGSVC - Real reads

whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

F1

g r a p h

  • b

a s e d S V g e n

  • t

y p e r s t r a d i t i

  • n

a l S V g e n

  • t

y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats

slide-11
SLIDE 11

From SV catalogs in human 11

Simple repeat/low complexity regions are challenging

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

precision recall repeat class/family

SINE/Alu LTR/ERV1 LINE/L1 Retroposon/SVA Low_complexity Satellite Satellite/centr Simple_repeat

SV type

INS DEL SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.

slide-12
SLIDE 12

From de novo assemblies in yeast 12

Challenges with the VCF format

Multiple equivalent representations, over-simplification, impractical.

VCF v4.2 specs

Why not start directly from de novo assemblies?

slide-13
SLIDE 13

From de novo assemblies in yeast 13

Analysis of 12 yeast strains from 2 clades

Selected 5 strains to build graph: one reference + 2 per clade.

Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference Illumina reads VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph

slide-14
SLIDE 14

From de novo assemblies in yeast 14

Evaluating SV genotyping using mapping statistics

No gold-standard to compare with.

Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph short reads Evaluation

slide-15
SLIDE 15

From de novo assemblies in yeast 14

Evaluating SV genotyping using mapping statistics

No gold-standard to compare with. Map reads to a sample graph built from the SV calls:

Assemblies for 5 yeast strains VCF Cactus graph Cactus aligner vg vg Pairwise alignment with reference VCF Compare mapping metrics vg genotype vg genotype vg VCF vg map map VCF graph Cactus graph short reads Evaluation

Mapping quality ∼ Sample graph quality ∼ SV calls quality.

slide-16
SLIDE 16

From de novo assemblies in yeast 15

Better mapping for SVs called in the cactus graph

Analysis restricted to reads at variation sites.

  • UWOPS919171

UFRJ50816 YPS138 N44 CBS432 UWOPS034614 YPS128 Y12 SK1 DBVPG6765 DBVPG6044

0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8

VCF: average mapping identity Cactus: average mapping identity during graph construction

  • excluded

included

clade

  • cerevisiae

paradoxus

slide-17
SLIDE 17

Conclusions and future directions 16

Conclusions

The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better.

Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/

slide-18
SLIDE 18

Conclusions and future directions 16

Conclusions

The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better.

Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/

Future directions

Experiment with high-quality human de novo assemblies (e.g. the Human PanGenome Project). Combine public SV catalogs and genotype SVs in a large and diverse cohort.

slide-19
SLIDE 19

Acknowledgment 17

Acknowledgment

Benedict Paten Glenn Hickey David Heller Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson

slide-20
SLIDE 20

18

Universal genome graph

slide-21
SLIDE 21

19

Some methods “over-genotype” similar variants

  • vg

Paragraph BayesTyper SVTyper Delly Genotyper SMRT−SV v2 Genotyper 0.9 1.0 1.1 1.2

average number of genotyped calls per truth call method experiment

  • HGSVC real reads

GIAB CHM−PD SVPOP

type

  • INS

DEL S 1 S 2 S 3 Truth set S1 Paragraph S1 vg

slide-22
SLIDE 22

20

Deletion correctly genotyped by vg

deletion 3’ UTR of LONRF2 gene reads graph GRCh38 chr2

51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.

slide-23
SLIDE 23

21

Simulation experiment

True SVs in VCF Errors in VCF INS DEL INV 1 3 7 13 20 1 3 7 13 20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Depth Best F1 Method

vg Paragraph BayesTyper SVTyper Delly Genotyper

slide-24
SLIDE 24

22

SV catalog summary results

HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP INS DEL all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Genomic regions Best F1 Method

vg Paragraph BayesTyper SVTyper Delly Genotyper SMRT−SV v2 Genotyper

SV evaluation

presence genotype

slide-25
SLIDE 25

23

Precision-recall curve

  • INS

DEL 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Recall Precision Genomic regions

  • all

non−repeat

Method

  • vg

Paragraph BayesTyper SVTyper Delly Genotyper

slide-26
SLIDE 26

24

Evaluation per SV size

INS all regions INS non−repeat regions DEL all regions DEL non−repeat regions HGSVC simulated reads HGSVC real reads GIAB

[ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K [ 5 , 1 ] ( 1 , 2 ] ( 2 , 3 ] ( 3 , 4 ] ( 4 , 6 ] ( 6 , 8 ] ( 8 , 1 K ] ( 1 K , 2 . 5 K ] ( 2 . 5 K , 5 K ] > 5 K

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Size (bp) F1 score

slide-27
SLIDE 27

25

Better mapping for SVs called in the cactus graph

Analysis restricted to reads at variation sites.

  • UFRJ50816

YPS138 N44 CBS432 UWOPS034614 YPS128 Y12 SK1 DBVPG6765 DBVPG6044

20 25 30 35 40 20 25 30 35 40

VCF: average mapping quality Cactus: average mapping quality during graph construction

  • excluded

included

clade

  • cerevisiae

paradoxus