[PPT] - How we built a global search engine for genetic data Miro Cupak VP PowerPoint Presentation

SLIDE 1

@mirocupak

Miro Cupak

VP Engineering, DNAstack 13/06/2018

How we built a global search engine for genetic data

SLIDE 2

@mirocupak

What and why?

2

Beacon Network
https://beacon-network.org/
largest search and discovery engine of human genetic mutations
from the Global Alliance for Genomics & Health (GA4GH)
case study

problem standard architecture technologies fun with stats

SLIDE 3

@mirocupak

Background

3

SLIDE 4

@mirocupak 4

https://beacon-network.org

SLIDE 5

@mirocupak 5

https://beacon-network.org

SLIDE 6

@mirocupak

Trends

6

https://www.nature.com/news/technology- the-1-000-genome-1.14901

sequencing cost decreasing exponentially (3M times since 2000)

SLIDE 7

@mirocupak

Trends

7

http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195

genomic data volume increasing exponentially (1M times since 2000)

SLIDE 8

@mirocupak

Trends

8

Data Volumes by 2025 (GB)

0E+00 1E+10 2E+10 3E+10 4E+10

Twitter Youtube Genomics

Lower Bound Upper Bound

http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195

up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to and )

SLIDE 9

@mirocupak

no single institution will have sufficient resources
still, institutions don’t have enough data
common diseases
rare diseases
challenge
discovering data
solution
traditional approach of data aggregation in a single centralized site not working
federated system capable of executing cross-dataset and cross-institution queries is needed

Problem

9

SLIDE 10

@mirocupak

nonprofit standards organization
a coalition of over 500 leading institutions working in health care,

research, disease advocacy, life science, and information technology

goal: enable responsible sharing of genomic and clinical data
established in 2013

GA4GH & Beacon Project

10

experiment to test the willingness of international sites to share

genetic data in the simplest of all technical contexts

initiative requiring collaboration of many different GA4GH groups
started in 2014 and quickly gained traction

http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/ GAWhitePaperJune3.pdf https://beacon-project.io/

SLIDE 11

@mirocupak

Beacon

11

SLIDE 12

@mirocupak

simple web service allowing users to query institution’s databases to determine whether they

contain a genetic variant of interest

receives questions of the form Do you have information about this mutation?
responds with yes or no, optionally with additional information about the mutation
design principles
A beacon has to be technically simple.
A beacon has to minimize risks associated with genomic data sharing.
It has to be possible to make a beacon publicly available.

Beacon

12

SLIDE 13

@mirocupak

no formal specification
receives questions of the form Do you have information about this mutation?
responds with yes or no
4 public beacons, each API different

Standard: Before Beacon Network

13

request method
supported parameters
parameter names
chromosome identifiers
positional base
assembly notation
supported alleles
dataset support
response format
data included in the response

SLIDE 14

@mirocupak 14

Standard: Before Beacon Network

SLIDE 15

@mirocupak

2014
really simple (2 records)
true/false response
format: Avro
not enough traction
too vague
issues partially addressed by the Beacon Network

Standard: 0.1

15

SLIDE 16

@mirocupak

2015
true/false/overlap/null response
datasets
simple data use conditions
self description
format: Avro
complex (9 records)
not well adopted
not polished enough

Standard: 0.2

16

SLIDE 17

@mirocupak

2016
simplified 0.2
based on real needs, successful
true/false/null response
data model improvements, extended

metadata and response, improved support for datasets and cross-dataset queries, data versioning

modular and extensible
tooling
format: Avro → Proto3

Standard: 0.3

17

SLIDE 18

@mirocupak

2018
stable and more flexible
support for more complex

mutations

improved error handling
improved data use conditions
various minor improvements
developer experience
format: Proto3 → OpenAPI

Standard: 0.4

18

SLIDE 19

@mirocupak

Beacon Network

19

SLIDE 20

@mirocupak

Architecture

20

SLIDE 21

@mirocupak

Data

21

access data stored in a relational database

SLIDE 22

@mirocupak

Service

22

communication with other subsystems
query normalization
aggregators
participant resolution
query distribution
audit trail
L1 parallelization

SLIDE 23

@mirocupak

Processor

23

executing a query against a beacon

and processing its response

management of a flexible, dynamic and

easily extensible query execution pipeline

pipeline stages resolution (CDI and EJB)
L2 parallelization
cross-assembly query handling

SLIDE 24

@mirocupak

Converter

24

first stage in the query execution pipeline
translating query parameters

SLIDE 25

@mirocupak

Requester

25

second stage in the query execution pipeline
constructing beacon requests based on their

URIs and parameters produced by the converters

SLIDE 26

@mirocupak

Fetcher

26

third stage in the query execution pipeline
unit actually talking to the API of beacons
submitting requests over the network and
btaining the raw response

SLIDE 27

@mirocupak

Parser

27

last stage in the pipeline
extracting information of interest from the

raw response obtained by a fetcher

dealing with various formats
handling metadata, multiple responses, errors
response normalization
parallelized

SLIDE 28

@mirocupak

Mapper

28

translation between different representations of objects

SLIDE 29

@mirocupak

REST

29

handling client requests
data serialization

SLIDE 30

@mirocupak

Search execution

30

SLIDE 31

@mirocupak

Stats

31

SLIDE 32

@mirocupak

100 installations
40 institutions
18 countries
6 continents

Size

32

SLIDE 33

@mirocupak

Users

33

13k users
136 countries

SLIDE 34

@mirocupak 34

Searches

SLIDE 35

@mirocupak

Assemblies

35

Others 11% GRCh38 6% GRCh37 83%

SLIDE 36

@mirocupak

Chromosomes

36

Others 39%

Chr. 7

7%

Chr. 13

11%

Chr. 1

11%

Chr. 17

14%

Chr. 2

18%

SLIDE 37

@mirocupak

Variants

37

Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6%

84k distinct mutations

SLIDE 38

@mirocupak

Deleteriousness

38

Number of variants

1 1000 1000000

Score

0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98

Number of variants

1 1000 1000000

Score

0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98

SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign

SLIDE 39

@mirocupak

25% rare variants (1,000 Genomes Project)

Rarity

39

Number of variants

1 100 10000

Allele frequency

0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99

SLIDE 40

@mirocupak

Genes

40

Symbol Name

1

FAM110C Family With Sequence Similarity 110 Member C

2

BRCA1 BRCA1, DNA Repair Associated

3

BRCA2 BRCA2, DNA Repair Associated

4

PPARA Peroxisome Proliferator Activated Receptor Alpha

5

ERBB4 Erb-B2 Receptor Tyrosine Kinase 4

6

BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase

7

MPL MPL Proto-Oncogene, Thrombopoietin Receptor

8

MYH7 Myosin Heavy Chain 7

9

KIT KIT Proto-Oncogene Receptor Tyrosine Kinase

10

RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%

SLIDE 41

@mirocupak

Disorders & clinical abnormalities

41

OMIM HPO 1

Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance

2

Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance

3

Fanconi anemia, complementation group D1 Scoliosis

4

Prostate cancer Short stature

5

Pancreatic cancer 2 Cognitive impairment

6

Medulloblastoma Constipation

7

Glioblastoma 3 Somatic mutation

8

Breast-ovarian cancer, familial, 2 Cafe-au-lait spot

9

Breast cancer, male, susceptibility to Failure to thrive

10

Wilms tumor Nausea and vomiting

SLIDE 42

@mirocupak

Questions?

42

https://mirocupak.com