How we built a global search engine for genetic data Miro Cupak VP - - PowerPoint PPT Presentation

how we built a global search engine for genetic data
SMART_READER_LITE
LIVE PREVIEW

How we built a global search engine for genetic data Miro Cupak VP - - PowerPoint PPT Presentation

How we built a global search engine for genetic data Miro Cupak VP Engineering, DNAstack 13/06/2018 @mirocupak What and why? Beacon Network https://beacon-network.org/ largest search and discovery engine of human genetic mutations


slide-1
SLIDE 1

@mirocupak

Miro Cupak

VP Engineering, DNAstack 13/06/2018

How we built a global search engine for genetic data

slide-2
SLIDE 2

@mirocupak

What and why?

2

  • Beacon Network
  • https://beacon-network.org/
  • largest search and discovery engine of human genetic mutations
  • from the Global Alliance for Genomics & Health (GA4GH)
  • case study

problem standard architecture technologies fun with stats

slide-3
SLIDE 3

@mirocupak

Background

3

slide-4
SLIDE 4

@mirocupak 4

https://beacon-network.org

slide-5
SLIDE 5

@mirocupak 5

https://beacon-network.org

slide-6
SLIDE 6

@mirocupak

Trends

6

https://www.nature.com/news/technology- the-1-000-genome-1.14901

sequencing cost decreasing exponentially (3M times since 2000)

slide-7
SLIDE 7

@mirocupak

Trends

7

http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195

genomic data volume increasing exponentially (1M times since 2000)

slide-8
SLIDE 8

@mirocupak

Trends

8

Data Volumes by 2025 (GB)

0E+00 1E+10 2E+10 3E+10 4E+10

Twitter Youtube Genomics

Lower Bound Upper Bound

http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195

up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to and )

slide-9
SLIDE 9

@mirocupak

  • no single institution will have sufficient resources
  • still, institutions don’t have enough data
  • common diseases
  • rare diseases
  • challenge
  • discovering data
  • solution
  • traditional approach of data aggregation in a single centralized site not working
  • federated system capable of executing cross-dataset and cross-institution queries is needed

Problem

9

slide-10
SLIDE 10

@mirocupak

  • nonprofit standards organization
  • a coalition of over 500 leading institutions working in health care,

research, disease advocacy, life science, and information technology

  • goal: enable responsible sharing of genomic and clinical data
  • established in 2013

GA4GH & Beacon Project

10

  • experiment to test the willingness of international sites to share

genetic data in the simplest of all technical contexts

  • initiative requiring collaboration of many different GA4GH groups
  • started in 2014 and quickly gained traction

http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/ GAWhitePaperJune3.pdf https://beacon-project.io/

slide-11
SLIDE 11

@mirocupak

Beacon

11

slide-12
SLIDE 12

@mirocupak

  • simple web service allowing users to query institution’s databases to determine whether they

contain a genetic variant of interest

  • receives questions of the form Do you have information about this mutation?
  • responds with yes or no, optionally with additional information about the mutation
  • design principles
  • A beacon has to be technically simple.
  • A beacon has to minimize risks associated with genomic data sharing.
  • It has to be possible to make a beacon publicly available.

Beacon

12

slide-13
SLIDE 13

@mirocupak

  • no formal specification
  • receives questions of the form Do you have information about this mutation?
  • responds with yes or no
  • 4 public beacons, each API different

Standard: Before Beacon Network

13

  • request method
  • supported parameters
  • parameter names
  • chromosome identifiers
  • positional base
  • assembly notation
  • supported alleles
  • dataset support
  • response format
  • data included in the response
slide-14
SLIDE 14

@mirocupak 14

Standard: Before Beacon Network

slide-15
SLIDE 15

@mirocupak

  • 2014
  • really simple (2 records)
  • true/false response
  • format: Avro
  • not enough traction
  • too vague
  • issues partially addressed by the Beacon Network

Standard: 0.1

15

slide-16
SLIDE 16

@mirocupak

  • 2015
  • true/false/overlap/null response
  • datasets
  • simple data use conditions
  • self description
  • format: Avro
  • complex (9 records)
  • not well adopted
  • not polished enough

Standard: 0.2

16

slide-17
SLIDE 17

@mirocupak

  • 2016
  • simplified 0.2
  • based on real needs, successful
  • true/false/null response
  • data model improvements, extended

metadata and response, improved support for datasets and cross-dataset queries, data versioning

  • modular and extensible
  • tooling
  • format: Avro → Proto3

Standard: 0.3

17

slide-18
SLIDE 18

@mirocupak

  • 2018
  • stable and more flexible
  • support for more complex

mutations

  • improved error handling
  • improved data use conditions
  • various minor improvements
  • developer experience
  • format: Proto3 → OpenAPI

Standard: 0.4

18

slide-19
SLIDE 19

@mirocupak

Beacon Network

19

slide-20
SLIDE 20

@mirocupak

Architecture

20

slide-21
SLIDE 21

@mirocupak

Data

21

  • access data stored in a relational database
slide-22
SLIDE 22

@mirocupak

Service

22

  • communication with other subsystems
  • query normalization
  • aggregators
  • participant resolution
  • query distribution
  • audit trail
  • L1 parallelization
slide-23
SLIDE 23

@mirocupak

Processor

23

  • executing a query against a beacon

and processing its response

  • management of a flexible, dynamic and

easily extensible query execution pipeline

  • pipeline stages resolution (CDI and EJB)
  • L2 parallelization
  • cross-assembly query handling
slide-24
SLIDE 24

@mirocupak

Converter

24

  • first stage in the query execution pipeline
  • translating query parameters
slide-25
SLIDE 25

@mirocupak

Requester

25

  • second stage in the query execution pipeline
  • constructing beacon requests based on their

URIs and parameters produced by the converters

slide-26
SLIDE 26

@mirocupak

Fetcher

26

  • third stage in the query execution pipeline
  • unit actually talking to the API of beacons
  • submitting requests over the network and
  • btaining the raw response
slide-27
SLIDE 27

@mirocupak

Parser

27

  • last stage in the pipeline
  • extracting information of interest from the

raw response obtained by a fetcher

  • dealing with various formats
  • handling metadata, multiple responses, errors
  • response normalization
  • parallelized
slide-28
SLIDE 28

@mirocupak

Mapper

28

  • translation between different representations of objects
slide-29
SLIDE 29

@mirocupak

REST

29

  • handling client requests
  • data serialization
slide-30
SLIDE 30

@mirocupak

Search execution

30

slide-31
SLIDE 31

@mirocupak

Stats

31

slide-32
SLIDE 32

@mirocupak

  • 100 installations
  • 40 institutions
  • 18 countries
  • 6 continents

Size

32

slide-33
SLIDE 33

@mirocupak

Users

33

  • 13k users
  • 136 countries
slide-34
SLIDE 34

@mirocupak 34

Searches

slide-35
SLIDE 35

@mirocupak

Assemblies

35

Others 11% GRCh38 6% GRCh37 83%

slide-36
SLIDE 36

@mirocupak

Chromosomes

36

Others 39%

  • Chr. 7

7%

  • Chr. 13

11%

  • Chr. 1

11%

  • Chr. 17

14%

  • Chr. 2

18%

slide-37
SLIDE 37

@mirocupak

Variants

37

Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6%

  • 84k distinct mutations
slide-38
SLIDE 38

@mirocupak

Deleteriousness

38

Number of variants

1 1000 1000000

Score

0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98

Number of variants

1 1000 1000000

Score

0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98

SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign

slide-39
SLIDE 39

@mirocupak

  • 25% rare variants (1,000 Genomes Project)

Rarity

39

Number of variants

1 100 10000

Allele frequency

0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99

slide-40
SLIDE 40

@mirocupak

Genes

40

Symbol Name

1

FAM110C Family With Sequence Similarity 110 Member C

2

BRCA1 BRCA1, DNA Repair Associated

3

BRCA2 BRCA2, DNA Repair Associated

4

PPARA Peroxisome Proliferator Activated Receptor Alpha

5

ERBB4 Erb-B2 Receptor Tyrosine Kinase 4

6

BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase

7

MPL MPL Proto-Oncogene, Thrombopoietin Receptor

8

MYH7 Myosin Heavy Chain 7

9

KIT KIT Proto-Oncogene Receptor Tyrosine Kinase

10

RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%

slide-41
SLIDE 41

@mirocupak

Disorders & clinical abnormalities

41

OMIM HPO 1

Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance

2

Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance

3

Fanconi anemia, complementation group D1 Scoliosis

4

Prostate cancer Short stature

5

Pancreatic cancer 2 Cognitive impairment

6

Medulloblastoma Constipation

7

Glioblastoma 3 Somatic mutation

8

Breast-ovarian cancer, familial, 2 Cafe-au-lait spot

9

Breast cancer, male, susceptibility to Failure to thrive

10

Wilms tumor Nausea and vomiting

slide-42
SLIDE 42

@mirocupak

Questions?

42

https://mirocupak.com