Discovering Adenoid Cystic Carcinoma Biomarkers Using a - - PowerPoint PPT Presentation

▶

Jul 16, 2023 453 likes •750 views

Discovering Adenoid Cystic Carcinoma Biomarkers Using a Purpose-Built Hypergraph Database and Link Prediction SYSTEMS IMAGINATION, INC. PIETER DERDEYN CHRIS YOO, PH.D. Mapping Big Data Maps throughout history How do we map cancer? Dr.

SLIDE 1

Discovering Adenoid Cystic Carcinoma Biomarkers Using a Purpose-Built Hypergraph Database and Link Prediction

SYSTEMS IMAGINATION, INC. PIETER DERDEYN CHRIS YOO, PH.D.

SLIDE 2

Mapping Big Data

SLIDE 3

Maps throughout history…

SLIDE 4

How do we map cancer?

SLIDE 5

Dr Gerhard Michal,Editor of the Roche Biochemical Pathways

Data management - linking data into a useful framework
Interpretation of the meaning of the data in context
Dr. Michel, 40 years of biochemistry in one map

SLIDE 6

A Hypergraph Map of Cancer

Represent the data as knowledge – what’s the best way?

MYB NFIB MYB – NFIB Fusion ACC GO:0008150 (Biological Process)

Hypergraph

SLIDE 7

Multiple sources Requires harmonization

Populating the hypergraph

SLIDE 8

Use Case: Adenoid cystic carcinoma (ACC)

Rare (~1200/yr in US) Majority of ACC cases display activation of MYB, commonly through genomic translocation event with NFIB, both transcription factors Initial prognosis with surgery is good (5yr: 89%) but long term follow up indicates aggressive recurrence (15yr: 40%) What data can be examined to find hypotheses to explain these results?

SLIDE 9

Target: Gene fusions

Hybrid gene from two previously separate genes

(Wikipedia)

Are often oncogenes because they lead to much

more active abnormal proteins than normal genes

MYB+MYB1+NFIB

SLIDE 10

Gene fusions – Data Sources

SLIDE 11

For a given pair of nodes, we would like to predict whether they have a certain edge type connecting them For example, what is the likelihood that Lily works at Systems Imagination?

Link Prediction

Ryan Lily Li Jane University of Arizona Systems Imagination Has written a paper with Works at

SLIDE 12

Link Prediction

Train a supervised learning model using topological features like:

Path counts
Metapath counts

Ryan Lily Li Jane University of Arizona Systems Imagination

SLIDE 13

Link Prediction

Network Schema – a representation of all node types (metanodes) and the edge types (metaedges) between them

Scientist Has written a paper with Institution Works at

SLIDE 14

Link Prediction – Gene Fusions

Gene Gene Fusion Cancer Biological Process Member of Related to Controls expression Participates

SLIDE 15

Link Prediction – Gene Fusions

MYB-NFIB Gene Fusion SIM1 BRK1 MYB NFIB ACC BP1 BP2

SLIDE 16

Hyperedge Prediction – Gene Fusions

SIM1 BRK1 MYB NFIB MYB-NFIB Gene Fusion SIM1-BRK1 Gene Fusion ACC BP1 BP2

SLIDE 17

Mining Heterogenous Information Networks

Hetionet Cancer Research Hypergraph David Himmelstein et al. Systems Imagination, Inc. 47,000 nodes 695,464 nodes 11 metanodes 16 metanodes 2,250,000 edges 12,007,912 edges 24 metaedges 41 metaedges Himmelstein et al

SLIDE 18

Nodes and Edges

SLIDE 19

Paths

Predictions

SLIDE 20

Gene Fusion Prediction Pipeline

For a given pair of genes, are they in a gene fusion or not? Dataset: Cancer Research Hypergraph Database Features: DWPC (Degree Weighted Path Count), Degrees of nodes, prior likelihood of gene fusion Supervised Learning Models: Random Forest, Logistic Regression, Decision Trees, XGBoost, Neural Networks Model Interpretation: Assess predictions, feature analysis

SLIDE 21

Challenges

Data integration: Integrating data from dozens of sources and converting between 3 different formats Feature computation: 10 times the data, 100 times the computational cost

SLIDE 22

Strategies

NVIDIA DGX

40 CPUs
256 GB RAM
4 x Tesla V100 GPUs (64GB memory total)

Can do production level computation locally

SLIDE 23

Results

5 10 15 20 25 1 GPU NVIDIA DGX

Neural Net Training on NVIDIA DGX

Time Spent (hours)

Systems Imagination Benchmarking

Dense NN built with mxnet and keras 7 hidden layers with 200-700 neurons each 33,658,931 rows of data 18 features 6 classes

SLIDE 24

Strategies

Multi-processing: 3 lines of python code sped processing up by 6 times GPU acceleration:

Accelerated numpy computations by 10 times by moving to CuPy
Accelerated deep learning 20 times by using mxnet on NVIDIA DGX

Profiling and Debugging code: what is the bottleneck and how can I relieve it

Rabbit hole: Not optimizing just the code, but optimizing the time spent developing and running the

code

SLIDE 25

Results

Gene 1 Gene 2 Probability of Gene Fusion

EWSR1 HMGA2 0.929894619 BBS9 KMT2A 0.928350421 IQCJ KMT2A 0.927711269 CYP11B1 KMT2A 0.926647616 KMT2A TCIRG1 0.912818918 KMT2A VEPH1 0.911844923 CCR6 KMT2A 0.873986434 KCNQ1 KMT2A 0.834963505 ACSL1 KMT2A 0.834097153 EWSR1 RUNX1T1 0.832868597

SLIDE 26

Results

Predictions

SLIDE 27

Results

Predictions

SLIDE 28

Discovering Adenoid Cystic Carcinoma Biomarkers Using a Purpose-Built Hypergraph Database and Link Prediction

Mapping Big Data

Maps throughout history…

How do we map cancer?

A Hypergraph Map of Cancer

Populating the hypergraph

Use Case: Adenoid cystic carcinoma (ACC)

Target: Gene fusions

Gene fusions – Data Sources

Link Prediction

Link Prediction

Link Prediction

Link Prediction – Gene Fusions

Link Prediction – Gene Fusions

Hyperedge Prediction – Gene Fusions

Mining Heterogenous Information Networks

Nodes and Edges

Paths

Gene Fusion Prediction Pipeline

Challenges

Data integration: Integrating data from dozens of sources and converting between 3 different formats Feature computation: 10 times the data, 100 times the computational cost

Strategies

Results

Strategies

Results

Results

Results

The Team