Genotyping structural variants in TOPMed using pangenome graphs - PowerPoint PPT Presentation

Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13, 2020 GSP-TOPMed Analysis Workshop

Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C 1KGP 1KGP 1KGP Variation graph C C C G G G T G G C Seq. reads T G G G C Introduction 2

Mapping reads across structural variants Structural variants (SVs) are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations. Linear reference genome DELETION Variation graph DELETION C C C G G G INSERTION Introduction 3

The vg toolkit is a complete, open source solution for graph construction , read mapping , and variant calling . https://github.com/vgteam/vg Garrison et al. Nature Biotech 2018 vg can genotype structural variants from short-read sequencing datasets starting from public SV catalogs or de novo assemblies. Hickey et al. bioRxiv 2019, in press at Genome Biology Introduction 4

Genotyping SVs from long-read sequencing studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 15 SVPOP Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 GRCh38 HGSVC vg SV catalog VCF HG00514 vg short reads HG00514 genotyped SVs Evaluation HG00514 VCF SV genotyping with vg 5

SV genotyping accuracy for deletions and insertions whole − genome non − repeat regions 0.8 0.6 INS 0.4 0.2 0.0 F1 0.8 0.6 DEL 0.4 0.2 0.0 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d a l t r a d i t i o n p e r s S V g e n o t y S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats SV genotyping with vg 6

Combined SV catalogs from 3 long-read studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 SVPOP 15 Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 15000 SV type DEL INS 10000 variant 5000 0 50 100 1,000 10,000 100,000 size (bp) 71K deletions and 70K insertions include most of the common deletions and insertions in the population. SV genotyping with vg 7

760 TOPMed samples genotyped in 5 days Using BioData Catalyst as an alpha user. Workflow in Dockstore . TOPMed data imported from Gen3 . Genotyping and exploratory analysis on Terra using workflows and notebooks. ∼ $12 per sample (soon < $4 with new read mapper). SV genotyping in the BioData Catalyst ecosystem 8

TOPMed data available in Gen3 I selected the MESA cohort and exported the CRAM files to Terra. SV genotyping in the BioData Catalyst ecosystem 9

WDL workflow for vg in Dockstore SV genotyping in the BioData Catalyst ecosystem 10

Genotyping and analysis on Terra SV genotyping in the BioData Catalyst ecosystem 11

SV genotyped in 760 diverse genomes SVs across 760 samples 12

Frequency estimates Insertions slightly more frequent than deletions... ...especially for larger variants. Hundreds of fixed SVs, especially insertions. SVs across 760 samples 13

Fixed insertions 736 insertions with allele frequency > 0.99. Two repeat expansions in coding regions of SAMD1 and FOXO6. Screenshots from https://gnomad.broadinstitute.org/ SVs across 760 samples 14

Fine-tuning breakpoints of deletions Although sequence-resolved, many deletions are extremely similar and likely near-duplicates of the same real deletion. Deletions SV catalog S Sample 1 S S Paragraph 1 3 2 Sample 1 vg S1 S1 Truth set Paragraph vg In > 9K clusters, the 760 samples supported mostly one variant. SVs across 760 samples 15

Coding deletions with fine-tuned breakpoints 95 of the fine-tuned deletions overlap coding regions. Two near-duplicated deletions overlapped DRD4 gene. Within long short tandem repeat... 96 bp or 97 bp deletion? → All samples supported the 96 bp deletion. Known 2-copies version of the 48nt repeat (DRD4-2R). hg38 Scale 200 bases chr11: 639,600 639,700 639,800 639,900 640,000 640,100 640,200 640,300 640,400 GENCODE v32 Comprehensive Transcript Set (only Basic displayed by default) DRD4 OMIM Genes - Dark Green Can Be Disease-causing 126452 Repeating Elements by RepeatMasker RepeatMasker Simple Tandem Repeats by TRF CGCCGCCCTCCCG... CGCCCCCCGCGCC... SVs across 760 samples 16

Structural Variation TOPMed GRCh38 Long-read studies Short reads Phenotypes HGSVC, SVPOP, GIAB Short-read studies SV genotypes vg vg gnomAD, TOPMed SV-WG 0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1 Human Pangenome Association study High-quality phased Annotated SV catalog assemblies Conclusions The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution. Conclusions and future directions 17

Conclusions The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution. Future directions Documented workflows for the BioData Catalyst community (and GSP through NHGRI AnVIL). More SVs genotyped in more TOPMed samples for association studies. Structural Variation TOPMed GRCh38 Long-read studies Short reads Phenotypes HGSVC, SVPOP, GIAB Short-read studies SV genotypes vg vg gnomAD, TOPMed SV-WG 0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1 Human Pangenome Association High-quality phased study Annotated SV catalog assemblies Conclusions and future directions 17

Acknowledgment vg Team BioData Catalyst Team Benedict Paten Beth Sheets (talk to her!) Glenn Hickey Michael Baumann David Heller Brian Hannafious Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson Acknowledgment 18

Genotyping variants in vg deletion Linear reference insertion reads not mapped reads on linear reference Snarl 1 Snarl 2 Path coverage ratio 1:1.6 → het Path coverage ratio 0:2 → hom Read mapping to reference path Read mapping to variant path Reference path Variant path Graph reference insertion Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”. 20

Results on HGSVC - Simulated reads whole − genome non − repeat regions 1.00 0.75 INS 0.50 0.25 0.00 F1 1.00 0.75 DEL 0.50 0.25 0.00 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d o n a l t r a d i t i y p e r s S V g e n o t S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats 21

Deletion correctly genotyped by vg reads GRCh38 chr2 deletion graph 51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene. 3’ UTR of LONRF2 gene 22

Simple repeat/low complexity regions are challenging 1.00 repeat class/family SINE/Alu LTR/ERV1 0.75 LINE/L1 Retroposon/SVA Low_complexity recall Satellite 0.50 Satellite/centr Simple_repeat 0.25 SV type INS DEL 0.00 0.00 0.25 0.50 0.75 1.00 precision SV sequence annotated with RepeatMasker. Class assigned if covered ≥ 80% by a repeat element. 23

Frequency distribution vs variant size 24

UMAP 25

Genotype quality and samples with genotype calls 26

Genotyping structural variants in TOPMed using pangenome graphs - PowerPoint PPT Presentation

Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13, 2020 GSP-TOPMed Analysis Workshop Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G

Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7,

Multiplex arrays for Genotyping Multiplex arrays for Genotyping 2 Alere Technologies, Jena 3 4

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

2019 Benefit Launch AGENDA Scheme Update Claims Experience Why Topmed? Market Awareness 2019

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu

Acknowledgements CONFIDENTIAL Next Generation Genotyping Workflow (Same for all RIPTIDE

1 Laboratory organization Outline Wet lab: working on biological samples Lab organization

Structural Health Monitoring Structural Health Monitoring Using Using PZT Impedance

1 Population genetics: technology driven Which genotyping technique to use? Time required

Minor variants in HIV-1 Minor variants in HIV-1 Why? Why? University of Cologne Institute of

Influence of the K103N minor variants in Influence of the K103N minor variants in therapy-nave

On the variants of treewidth and minor-closedness property O-joung Kwon KAIST in Daejeon, Korea

Predic'ng 'ssue-specific effects of rare gene'c variants Farhan Damani Biological Data Sciences

On Variants of Modified Bar Recursion Paulo Oliva Queen Mary, University of London, UK

South Florida deep South Florida deep convection: Convective convection: Convective

Modular Data Storage with Anvil Mike Mamarella, Shant Hovsepian, Eddie Kohler Presented by

Modular Data Storage with Anvil Mike Mammarella Shant Hovsepian Eddie Kohler Motivation

Question 1 Lecture Outline A 42 yo woman is brought to the ED pulseless. Resuscitation is

Automated Curriculum Learning for Reinforcement Learning Feryal Behbahani Jeju Deep Learning

An exchange format for multimodal annotations Thomas Schmidt, Susan Duncan, Oliver Ehmer,

E vil men have tried to destroy the Word of God since God inspired prophets and apostles to write

Everything You Know About MongoDB is Wrong (Probably) Mark Smith | MongoDB | @Judy2K Myth 0

Sambuz

Useful Links

Newsletter

Mail Us