[PPT] - Personalized Regression Enables Sample-Specific Pan-Cancer Analysis PowerPoint Presentation

SLIDE 1

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis

Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata

1

SLIDE 2

Ben Lengerich | ISMB 2018

Cancer is Complex

Different mutations can cause similar phenotypes.
There are many possible driver mutations.

2

Do we need to build a single model that works for all

cancers?

Could we build a different model for each type of cancer?
But cancer “type” may not correspond to any single

clinical covariate.

SLIDE 3

Ben Lengerich | ISMB 2018

The Extreme: Sample- Specific Models

What if we try to understand tumors one at a time?
Could we use simple models that each work for a single

patient?

3

Enable new types of questions to be asked: “How

does this tumor’s model differ from the cohort’s?”

SLIDE 4

Ben Lengerich | ISMB 2018

Our Goal

Sample-Specific, Pan-Cancer Models:

4

Samples

Model Parameters

SLIDE 5

Ben Lengerich | ISMB 2018

Why Sample-Specific Models?

5

Deep Learning Mixed Effects Mixtures Sample-Specific Varying-Coefficient

Simple Effects Complicated Effects

“This tumor is due to a mutation in gene TP53”

Universal Effects Personal Effects

“Self-driving cars”

SLIDE 6

Ben Lengerich | ISMB 2018

Why Pan-Cancer Models?

Share information between

rare and common cancer types

Uncover molecular subtypes
If we can handle clinical

covariates well, tissue type can be simply treated as another covariate

6

Number of Samples by Tissue Type in TCGA1

1. cancergenome.nih.gov

SLIDE 7

Ben Lengerich | ISMB 2018

Related Work

7

1. Hastie and Tibshirani. Journal of the Royal Statistical Society 1993
2. Song et al. NIPS 2009, 3. Kolar et al. NIPS 2009, 4. Parikh et al. ISMB 2011
5. Kuijjer et al. Arxiv 2015, 6. Liu et al. Nucleic Acids Research 2016

Sample-Specific Models? Unknown Covariate Effects? General Framework? Varying-Coefficient [1] Known Structure [2,3,4] Sample-Specific Network Estimation [5,6] Personalized Regression

SLIDE 8

Ben Lengerich | ISMB 2018

Personalized Regression

From estimating a single model:

8

Y = XβT + ϵ

Y(i) = X(i)β(i)T + ϵ(i)

Overparameterized, but not hopeless!

To estimating sample-specific models:

Samples Model Parameters β(1) β(2)

⋮

β(N)

SLIDE 9

Ben Lengerich | ISMB 2018

Personalized Regression

Define the sample-specific loss functional to be minimized:

9

ℒ(β; dβ, dU) ∝

N

∑

i=1

ℒ(i)(β(i); dβ, dU)

ℒ(i)(β(i); dβ, dU) ∝ f(X(i), Y(i), β(i))

Prediction Loss

+ ρβ

λ (β(i)) Regularization

+ ϱ(i)

γ (dβ, dU) Distance-Matching

Overparameterized, but not hopeless!

SLIDE 10

Ben Lengerich | ISMB 2018

Distance Matching Regularization

Main idea: Distance between sample parameters should be

similar to distance between sample covariates.

10

Define a regularization loss functional to be minimized:

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2

Pairwise distances between all samples

SLIDE 11

Ben Lengerich | ISMB 2018

Distance Metrics Can Be Learned From Data

Define distance metrics as linear combinations of feature-

wise distance metrics:

11

dβ(x, y) = [|x1 − y1|, …, |xP − yP|]ϕT

β

dU(x, y) = [dU1(x1, y1), …, dUK(xK, yK)]ϕT

U

After optimization, we can inspect the values in
User must supply covariate-specific distance metrics.
Can use complicated covariate distance metrics.

to understand contributions to personalization.

ϕβ , ϕU

SLIDE 12

Ben Lengerich | ISMB 2018

When is Personalized Regression Useful?

We are seeking a model for inference, not necessarily most

accurate predictive model.

We are seeking relatively simple personalized effects, not

complex universal effects.

We have covariate data which is informative of each sample.

12

SLIDE 13

Experiments

13

SLIDE 14

Ben Lengerich | ISMB 2018

TCGA Pan-Cancer Analysis

Model: Logistic Regression with

Lasso Regularization

Task: Predict Case/Control Status
Data:
28 primary sites
9663 samples (8944 case, 719

control)

4123 RNA-Seq features
14 clinical covariates

14

Number of Samples by Tissue Type in TCGA1

1. cancergenome.nih.gov

SLIDE 15

Ben Lengerich | ISMB 2018

Clinical Covariates

14 Clinical Covariates:
Tissue Features: Disease Type, Primary Site, Days to

Collection

Sample Molecular Biomarkers: Pct. Tumor Cells, Pct.

Normal Cells, Pct. Tumor Nuclei, Pct. Lymphocyte Infiltration,

Pct. Stromal Cells, Pct. Monocyte Infiltration, Pct. Neutrophil

Infiltration

Patient Demographic Features: Age at Diagnosis, Year of

Birth, Gender, Race

Traditional methods expect these data encoded as one-hot

vectors, which expands dimensionality 5X!

15

SLIDE 16

Ben Lengerich | ISMB 2018

Personalized Models Are More Efficient with Variable Selection

Selects Fewer Genes Per Sample:

16

Uses each Gene in Fewer Samples:

Red Lines Indicate Number of Variables Selected by Tissue-Specific Models Most Genes are Selected for Fewer than 500 Samples

SLIDE 17

Ben Lengerich | ISMB 2018

Personalized Regression Gives More Weight to Known Oncogenes [1]

17

Many methods effectively identify common oncogenes Few methods effectively identify rare oncogenes

1. Oncogenes as annotated in COSMIC (Forbes et al. Nucleic Acids Research 2014)

SLIDE 18

Ben Lengerich | ISMB 2018

Personalized Regression Produces Sample-Specific Pan-Cancer Models

18

Samples Genes

Red Line = oncogene

SLIDE 19

Ben Lengerich | ISMB 2018

Personalized Models Reveal Molecular Subtypes Which Span Tissues

19

Samples Genes

Over-represented for the GO

biological process term “Modulation

f Chemical Synaptic

Transmission" (p <0.05FDR)

Includes genes ATP1A2, SLC6A4,

ASIC1, GRM3, and SLC8A3, which code for ion-transport processes.

Ion-transport processes have long

been seen in vivo as an important system in thyroid cancer [1] and in vitro from leukemic cells [2], but

nly recently as a functional

marker across different cancer types [3].

1. Filetti et al. European Journal of Endocrinology 1999 2. Morgan et al. Cancer Research 1986 3. Scafoglio et al. PNAS 2015

SLIDE 20

Ben Lengerich | ISMB 2018

Personalized Models Form Clusters with Distinct Signatures

20

Extracellular Processes - Antigen Cellular Metabolism Extracellular Processes - Membrane

SLIDE 21

Ben Lengerich | ISMB 2018

Personalized Regression Learns Clinical Distance Metrics

21

SLIDE 22

Ben Lengerich | ISMB 2018

Conclusions

Sample-specific models can give us a new perspective.
Unlock bottom-up in addition to traditional top-down

analyses.

Personalized Regression with Distance-Matching

Regularization effectively learns sample-specific models.

Personalized Regression reveals patterns in pan-cancer

transcriptomic data that are overlooked by traditional analyses.

22

SLIDE 23

Ben Lengerich | ISMB 2018

Future Work

Biological Questions - Sample-Specific Processes?
More complex personalized models
Personalized Regression for Single-Cell Data,

Election Modeling, Stock Prediction

23

SLIDE 24

Ben Lengerich | ISMB 2018

Code available at: github.com/blengerich/ personalized_regression Collaborators:

Bryon Aragam
Eric P

. Xing

Contact: {blengeri, epxing}

@cs.cmu.edu Travel to ISMB generously supported by ISCB Research supported by NIH

Thank You

24

SLIDE 25

The Gory Details

25

SLIDE 26

Ben Lengerich | ISMB 2018

Personalized Regression: Optimization

Define pairwise distance vectors by:
Construction of the covariate distance tensor can be amortized

26

Δ(i,j)

β

= [dβ1(β(i)

1 , β(j) 1 ), …, dβP(β(i) P , β(j) P )]

Δ(i,j)

U

= [dU1(U(i)

1 , U(j) 1 ), …, dUK(U(i) K , U(j) K )]

SLIDE 27

Ben Lengerich | ISMB 2018

Avoiding Degenerate Solutions

Add priors to distance metrics
From:
To:

27

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2 + ψα(dβ) + ψυ(dU)

SLIDE 28

Ben Lengerich | ISMB 2018

Avoiding Degenerate Solutions

Add priors to distance metrics
where
and we project loadings into the non-negative reals.

28

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2 + ψα(dβ) + ψυ(dU)

ψα(dβ) = α||ϕβ − ϕ0

beta||2

ψυ(dU) = υ||ϕU − ϕ0

U||2

SLIDE 29

Ben Lengerich | ISMB 2018

Personalized Regression

Initialize at population solution
Allow each personalized model

to “fine-tune” away from the central population solution (block coordinate descent)

29

Distance-matching

regularization ensures the personalized models respect covariate structure

SLIDE 30

Ben Lengerich | ISMB 2018

Inference Procedure

Conveniently, we

have already learned distance metrics to use for predictions.

On test data, we

identify the closest neighbors and use their sample-specific models.

30

SLIDE 31

Ben Lengerich | ISMB 2018

2550 125 250 500

Number of Samples

0.4 0.5 0.6 0.7 0.8 0.9 1.0

||ˆ β−β||2 ||ˆ βpop−β||2

Population Mixture VC Personalized

Simulation Results

At moderate sample sizes, personalized regression recovers parameters well.
At low sample size, cannot learn distance metrics.

31

Recovery Error (Lower is Better)

SLIDE 32

Ben Lengerich | ISMB 2018

Personalized Regression Fine-Tunes Accuracy

Here, personalized regression
verfits the data but is still

better than competing methods.

Better clinical distance metrics

and hyperparameter tuning will likely alleviate overfitting.

32

SLIDE 33

Ben Lengerich | ISMB 2018

Personalized Regression Does Not Merely Identify More Enriched Gene Sets

Instead, it identifies a variety of sample-specific patterns which do not fit into a small number of mixtures

33

Enrichment Analysis of Complete Rankings: