Genotyping structural variants in TOPMed using pangenome graphs - - PowerPoint PPT Presentation

genotyping structural variants in topmed using pangenome
SMART_READER_LITE
LIVE PREVIEW

Genotyping structural variants in TOPMed using pangenome graphs - - PowerPoint PPT Presentation

Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13, 2020 GSP-TOPMed Analysis Workshop Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G


slide-1
SLIDE 1

Genotyping structural variants in TOPMed using pangenome graphs

Jean Monlong February 12-13, 2020

GSP-TOPMed Analysis Workshop

slide-2
SLIDE 2

Introduction 2

Pangenome graphs and variant-aware read mapping

A C A T

T T G G G G G C C

C

T T G G G G G C C

Linear reference genome Variation graph

1KGP 1KGP 1KGP

G C G C G

  • Seq. reads
  • Seq. reads
slide-3
SLIDE 3

Introduction 3

Mapping reads across structural variants

Structural variants (SVs) are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations.

C

Linear reference genome Variation graph

G C G C G

DELETION INSERTION DELETION

slide-4
SLIDE 4

Introduction 4

The vg toolkit is a complete, open source solution for graph construction, read mapping, and variant calling. https://github.com/vgteam/vg

Garrison et al. Nature Biotech 2018

vg can genotype structural variants from short-read sequencing datasets starting from public SV catalogs or de novo assemblies.

Hickey et al. bioRxiv 2019, in press at Genome Biology

slide-5
SLIDE 5

SV genotyping with vg 5

Genotyping SVs from long-read sequencing studies

Ref. Project Samples Chaisson et al. 2019 Human Genome Structural 3 Variation Consortium (HGSVC) Audano et al. 2019 SVPOP 15 Zook et al. 2019 Genome in a Bottle (GIAB) 1

short reads vg vg

HG00514 VCF VCF

HGSVC SV catalog genotyped SVs Evaluation

HG00514 HG00514

GRCh38

slide-6
SLIDE 6

SV genotyping with vg 6

SV genotyping accuracy for deletions and insertions

whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

F1

g r a p h

  • b

a s e d S V g e n

  • t

y p e r s t r a d i t i

  • n

a l S V g e n

  • t

y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats

slide-7
SLIDE 7

SV genotyping with vg 7

Combined SV catalogs from 3 long-read studies

Ref. Project Samples Chaisson et al. 2019 Human Genome Structural 3 Variation Consortium (HGSVC) Audano et al. 2019 SVPOP 15 Zook et al. 2019 Genome in a Bottle (GIAB) 1

5000 10000 15000 50 100 1,000 10,000 100,000

size (bp) variant SV type

DEL INS

71K deletions and 70K insertions include most of the common deletions and insertions in the population.

slide-8
SLIDE 8

SV genotyping in the BioData Catalyst ecosystem 8

760 TOPMed samples genotyped in 5 days

Using BioData Catalyst as an alpha user. Workflow in Dockstore. TOPMed data imported from Gen3. Genotyping and exploratory analysis on Terra using workflows and notebooks. ∼$12 per sample (soon <$4 with new read mapper).

slide-9
SLIDE 9

SV genotyping in the BioData Catalyst ecosystem 9

TOPMed data available in Gen3

I selected the MESA cohort and exported the CRAM files to Terra.

slide-10
SLIDE 10

SV genotyping in the BioData Catalyst ecosystem 10

WDL workflow for vg in Dockstore

slide-11
SLIDE 11

SV genotyping in the BioData Catalyst ecosystem 11

Genotyping and analysis on Terra

slide-12
SLIDE 12

SVs across 760 samples 12

SV genotyped in 760 diverse genomes

slide-13
SLIDE 13

SVs across 760 samples 13

Frequency estimates

Insertions slightly more frequent than deletions... ...especially for larger variants. Hundreds of fixed SVs, especially insertions.

slide-14
SLIDE 14

SVs across 760 samples 14

Fixed insertions

736 insertions with allele frequency >0.99. Two repeat expansions in coding regions of SAMD1 and FOXO6.

Screenshots from https://gnomad.broadinstitute.org/

slide-15
SLIDE 15

SVs across 760 samples 15

Fine-tuning breakpoints of deletions

Although sequence-resolved, many deletions are extremely similar and likely near-duplicates of the same real deletion.

S 1 S 2 S 3 Truth set S1 Paragraph S1 vg SV catalog Sample 1 Paragraph Sample 1 vg Deletions

In >9K clusters, the 760 samples supported mostly one variant.

slide-16
SLIDE 16

SVs across 760 samples 16

Coding deletions with fine-tuned breakpoints

95 of the fine-tuned deletions overlap coding regions. Two near-duplicated deletions overlapped DRD4 gene. Within long short tandem repeat... 96 bp or 97 bp deletion? → All samples supported the 96 bp deletion. Known 2-copies version of the 48nt repeat (DRD4-2R).

Scale chr11: RepeatMasker 200 bases hg38 639,600 639,700 639,800 639,900 640,000 640,100 640,200 640,300 640,400 GENCODE v32 Comprehensive Transcript Set (only Basic displayed by default) OMIM Genes - Dark Green Can Be Disease-causing Repeating Elements by RepeatMasker Simple Tandem Repeats by TRF DRD4 126452 CGCCGCCCTCCCG... CGCCCCCCGCGCC...

slide-17
SLIDE 17

Conclusions and future directions 17

Conclusions

The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution.

Short reads

vg

Short-read studies

gnomAD, TOPMed SV-WG

TOPMed

GRCh38 Human Pangenome

High-quality phased assemblies

Long-read studies

HGSVC, SVPOP, GIAB Phenotypes SV genotypes

0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1

vg

Association study Annotated SV catalog

Structural Variation

slide-18
SLIDE 18

Conclusions and future directions 17

Conclusions

The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution.

Future directions

Documented workflows for the BioData Catalyst community (and GSP through NHGRI AnVIL). More SVs genotyped in more TOPMed samples for association studies.

Short reads

vg

Short-read studies

gnomAD, TOPMed SV-WG

TOPMed

GRCh38 Human Pangenome

High-quality phased assemblies

Long-read studies

HGSVC, SVPOP, GIAB Phenotypes SV genotypes

0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1

vg

Association study Annotated SV catalog

Structural Variation

slide-19
SLIDE 19

Acknowledgment 18

Acknowledgment

vg Team Benedict Paten Glenn Hickey David Heller Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson BioData Catalyst Team Beth Sheets (talk to her!) Michael Baumann Brian Hannafious

slide-20
SLIDE 20

19

slide-21
SLIDE 21

20

Genotyping variants in vg

reads not mapped

  • n linear reference

Linear reference Graph reference

insertion reads insertion deletion Snarl 1 Path coverage ratio 1:1.6 → het Snarl 2 Path coverage ratio 0:2 → hom Read mapping to reference path Variant path Reference path Read mapping to variant path

Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”.

slide-22
SLIDE 22

21

Results on HGSVC - Simulated reads

whole−genome non−repeat regions INS DEL vg Paragraph BayesTyper Delly SVTyper vg Paragraph BayesTyper Delly SVTyper 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

F1

g r a p h

  • b

a s e d S V g e n

  • t

y p e r s t r a d i t i

  • n

a l S V g e n

  • t

y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats

slide-23
SLIDE 23

22

Deletion correctly genotyped by vg

deletion 3’ UTR of LONRF2 gene reads graph GRCh38 chr2

51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.

slide-24
SLIDE 24

23

Simple repeat/low complexity regions are challenging

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

precision recall repeat class/family

SINE/Alu LTR/ERV1 LINE/L1 Retroposon/SVA Low_complexity Satellite Satellite/centr Simple_repeat

SV type

INS DEL SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.

slide-25
SLIDE 25

24

Frequency distribution vs variant size

slide-26
SLIDE 26

25

UMAP

slide-27
SLIDE 27

26

Genotype quality and samples with genotype calls