Discovering Adenoid Cystic Carcinoma Biomarkers Using a - - PowerPoint PPT Presentation

discovering adenoid cystic carcinoma biomarkers using a
SMART_READER_LITE
LIVE PREVIEW

Discovering Adenoid Cystic Carcinoma Biomarkers Using a - - PowerPoint PPT Presentation

Discovering Adenoid Cystic Carcinoma Biomarkers Using a Purpose-Built Hypergraph Database and Link Prediction SYSTEMS IMAGINATION, INC. PIETER DERDEYN CHRIS YOO, PH.D. Mapping Big Data Maps throughout history How do we map cancer? Dr.


slide-1
SLIDE 1

Discovering Adenoid Cystic Carcinoma Biomarkers Using a Purpose-Built Hypergraph Database and Link Prediction

SYSTEMS IMAGINATION, INC. PIETER DERDEYN CHRIS YOO, PH.D.

slide-2
SLIDE 2

Mapping Big Data

slide-3
SLIDE 3

Maps throughout history…

slide-4
SLIDE 4

How do we map cancer?

slide-5
SLIDE 5

Dr Gerhard Michal,Editor of the Roche Biochemical Pathways

  • Data management - linking data into a useful framework
  • Interpretation of the meaning of the data in context
  • Dr. Michel, 40 years of biochemistry in one map
slide-6
SLIDE 6

A Hypergraph Map of Cancer

Represent the data as knowledge – what’s the best way?

MYB NFIB MYB – NFIB Fusion ACC GO:0008150 (Biological Process)

Hypergraph

slide-7
SLIDE 7

Multiple sources Requires harmonization

Populating the hypergraph

slide-8
SLIDE 8

Use Case: Adenoid cystic carcinoma (ACC)

Rare (~1200/yr in US) Majority of ACC cases display activation of MYB, commonly through genomic translocation event with NFIB, both transcription factors Initial prognosis with surgery is good (5yr: 89%) but long term follow up indicates aggressive recurrence (15yr: 40%) What data can be examined to find hypotheses to explain these results?

slide-9
SLIDE 9

Target: Gene fusions

  • Hybrid gene from two previously separate genes

(Wikipedia)

  • Are often oncogenes because they lead to much

more active abnormal proteins than normal genes

  • MYB+MYB1+NFIB
slide-10
SLIDE 10

Gene fusions – Data Sources

slide-11
SLIDE 11

For a given pair of nodes, we would like to predict whether they have a certain edge type connecting them For example, what is the likelihood that Lily works at Systems Imagination?

Link Prediction

Ryan Lily Li Jane University of Arizona Systems Imagination Has written a paper with Works at

slide-12
SLIDE 12

Link Prediction

Train a supervised learning model using topological features like:

  • Path counts
  • Metapath counts

Ryan Lily Li Jane University of Arizona Systems Imagination

slide-13
SLIDE 13

Link Prediction

Network Schema – a representation of all node types (metanodes) and the edge types (metaedges) between them

Scientist Has written a paper with Institution Works at

slide-14
SLIDE 14

Link Prediction – Gene Fusions

Gene Gene Fusion Cancer Biological Process Member of Related to Controls expression Participates

slide-15
SLIDE 15

Link Prediction – Gene Fusions

MYB-NFIB Gene Fusion SIM1 BRK1 MYB NFIB ACC BP1 BP2

slide-16
SLIDE 16

Hyperedge Prediction – Gene Fusions

SIM1 BRK1 MYB NFIB MYB-NFIB Gene Fusion SIM1-BRK1 Gene Fusion ACC BP1 BP2

slide-17
SLIDE 17

Mining Heterogenous Information Networks

Hetionet Cancer Research Hypergraph David Himmelstein et al. Systems Imagination, Inc. 47,000 nodes 695,464 nodes 11 metanodes 16 metanodes 2,250,000 edges 12,007,912 edges 24 metaedges 41 metaedges Himmelstein et al

slide-18
SLIDE 18

Nodes and Edges

slide-19
SLIDE 19

Paths

Predictions

slide-20
SLIDE 20

Gene Fusion Prediction Pipeline

For a given pair of genes, are they in a gene fusion or not? Dataset: Cancer Research Hypergraph Database Features: DWPC (Degree Weighted Path Count), Degrees of nodes, prior likelihood of gene fusion Supervised Learning Models: Random Forest, Logistic Regression, Decision Trees, XGBoost, Neural Networks Model Interpretation: Assess predictions, feature analysis

slide-21
SLIDE 21

Challenges

Data integration: Integrating data from dozens of sources and converting between 3 different formats Feature computation: 10 times the data, 100 times the computational cost

slide-22
SLIDE 22

Strategies

NVIDIA DGX

  • 40 CPUs
  • 256 GB RAM
  • 4 x Tesla V100 GPUs (64GB memory total)

Can do production level computation locally

slide-23
SLIDE 23

Results

5 10 15 20 25 1 GPU NVIDIA DGX

Neural Net Training on NVIDIA DGX

Time Spent (hours)

Systems Imagination Benchmarking

Dense NN built with mxnet and keras 7 hidden layers with 200-700 neurons each 33,658,931 rows of data 18 features 6 classes

slide-24
SLIDE 24

Strategies

Multi-processing: 3 lines of python code sped processing up by 6 times GPU acceleration:

  • Accelerated numpy computations by 10 times by moving to CuPy
  • Accelerated deep learning 20 times by using mxnet on NVIDIA DGX

Profiling and Debugging code: what is the bottleneck and how can I relieve it

  • Rabbit hole: Not optimizing just the code, but optimizing the time spent developing and running the

code

slide-25
SLIDE 25

Results

Gene 1 Gene 2 Probability of Gene Fusion

EWSR1 HMGA2 0.929894619 BBS9 KMT2A 0.928350421 IQCJ KMT2A 0.927711269 CYP11B1 KMT2A 0.926647616 KMT2A TCIRG1 0.912818918 KMT2A VEPH1 0.911844923 CCR6 KMT2A 0.873986434 KCNQ1 KMT2A 0.834963505 ACSL1 KMT2A 0.834097153 EWSR1 RUNX1T1 0.832868597

slide-26
SLIDE 26

Results

Predictions

slide-27
SLIDE 27

Results

Predictions

slide-28
SLIDE 28

The Team