Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - - PowerPoint PPT Presentation

personalized regression enables sample specific pan
SMART_READER_LITE
LIVE PREVIEW

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - - PowerPoint PPT Presentation

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata 1 Cancer is Complex Di ff erent mutations


slide-1
SLIDE 1

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis

Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata

  • 1
slide-2
SLIDE 2

Ben Lengerich | ISMB 2018

Cancer is Complex

  • Different mutations can cause similar phenotypes.
  • There are many possible driver mutations.

2

  • Do we need to build a single model that works for all

cancers?

  • Could we build a different model for each type of cancer?
  • But cancer “type” may not correspond to any single

clinical covariate.

slide-3
SLIDE 3

Ben Lengerich | ISMB 2018

The Extreme: Sample- Specific Models

  • What if we try to understand tumors one at a time?
  • Could we use simple models that each work for a single

patient?

3

  • Enable new types of questions to be asked: “How

does this tumor’s model differ from the cohort’s?”

slide-4
SLIDE 4

Ben Lengerich | ISMB 2018

Our Goal

Sample-Specific, Pan-Cancer Models:

4

Samples

Model Parameters

slide-5
SLIDE 5

Ben Lengerich | ISMB 2018

Why Sample-Specific Models?

5

Deep Learning Mixed Effects Mixtures Sample-Specific Varying-Coefficient

Simple Effects Complicated Effects

“This tumor is due to a mutation in gene TP53”

Universal Effects Personal Effects

“Self-driving cars”

slide-6
SLIDE 6

Ben Lengerich | ISMB 2018

Why Pan-Cancer Models?

  • Share information between

rare and common cancer types

  • Uncover molecular subtypes
  • If we can handle clinical

covariates well, tissue type can be simply treated as another covariate

6

Number of Samples by Tissue Type in TCGA1

  • 1. cancergenome.nih.gov
slide-7
SLIDE 7

Ben Lengerich | ISMB 2018

Related Work

7

  • 1. Hastie and Tibshirani. Journal of the Royal Statistical Society 1993
  • 2. Song et al. NIPS 2009, 3. Kolar et al. NIPS 2009, 4. Parikh et al. ISMB 2011
  • 5. Kuijjer et al. Arxiv 2015, 6. Liu et al. Nucleic Acids Research 2016

Sample-Specific Models? Unknown Covariate Effects? General Framework? Varying-Coefficient [1] Known Structure [2,3,4] Sample-Specific Network Estimation [5,6] Personalized Regression

slide-8
SLIDE 8

Ben Lengerich | ISMB 2018

Personalized Regression

  • From estimating a single model:

8

Y = XβT + ϵ

Y(i) = X(i)β(i)T + ϵ(i)

Overparameterized, but not hopeless!

  • To estimating sample-specific models:

Samples Model Parameters β(1) β(2)

β(N)

slide-9
SLIDE 9

Ben Lengerich | ISMB 2018

Personalized Regression

  • Define the sample-specific loss functional to be minimized:

9

ℒ(β; dβ, dU) ∝

N

i=1

ℒ(i)(β(i); dβ, dU)

ℒ(i)(β(i); dβ, dU) ∝ f(X(i), Y(i), β(i))

Prediction Loss

+ ρβ

λ (β(i)) Regularization

+ ϱ(i)

γ (dβ, dU) Distance-Matching

Overparameterized, but not hopeless!

slide-10
SLIDE 10

Ben Lengerich | ISMB 2018

Distance Matching Regularization

  • Main idea: Distance between sample parameters should be

similar to distance between sample covariates.

10

  • Define a regularization loss functional to be minimized:

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2

Pairwise distances between all samples

slide-11
SLIDE 11

Ben Lengerich | ISMB 2018

Distance Metrics Can Be Learned From Data

  • Define distance metrics as linear combinations of feature-

wise distance metrics:

11

dβ(x, y) = [|x1 − y1|, …, |xP − yP|]ϕT

β

dU(x, y) = [dU1(x1, y1), …, dUK(xK, yK)]ϕT

U

  • After optimization, we can inspect the values in
  • User must supply covariate-specific distance metrics.
  • Can use complicated covariate distance metrics.

to understand contributions to personalization.

ϕβ , ϕU

slide-12
SLIDE 12

Ben Lengerich | ISMB 2018

When is Personalized Regression Useful?

  • We are seeking a model for inference, not necessarily most

accurate predictive model.

  • We are seeking relatively simple personalized effects, not

complex universal effects.

  • We have covariate data which is informative of each sample.

12

slide-13
SLIDE 13

Experiments

13

slide-14
SLIDE 14

Ben Lengerich | ISMB 2018

TCGA Pan-Cancer Analysis

  • Model: Logistic Regression with

Lasso Regularization

  • Task: Predict Case/Control Status
  • Data:
  • 28 primary sites
  • 9663 samples (8944 case, 719

control)

  • 4123 RNA-Seq features
  • 14 clinical covariates

14

Number of Samples by Tissue Type in TCGA1

  • 1. cancergenome.nih.gov
slide-15
SLIDE 15

Ben Lengerich | ISMB 2018

Clinical Covariates

  • 14 Clinical Covariates:
  • Tissue Features: Disease Type, Primary Site, Days to

Collection

  • Sample Molecular Biomarkers: Pct. Tumor Cells, Pct.

Normal Cells, Pct. Tumor Nuclei, Pct. Lymphocyte Infiltration,

  • Pct. Stromal Cells, Pct. Monocyte Infiltration, Pct. Neutrophil

Infiltration

  • Patient Demographic Features: Age at Diagnosis, Year of

Birth, Gender, Race

  • Traditional methods expect these data encoded as one-hot

vectors, which expands dimensionality 5X!

15

slide-16
SLIDE 16

Ben Lengerich | ISMB 2018

Personalized Models Are More Efficient with Variable Selection

Selects Fewer Genes Per Sample:

16

Uses each Gene in Fewer Samples:

Red Lines Indicate Number of Variables Selected by Tissue-Specific Models Most Genes are Selected for Fewer than 500 Samples

slide-17
SLIDE 17

Ben Lengerich | ISMB 2018

Personalized Regression Gives More Weight to Known Oncogenes [1]

17

Many methods effectively identify common oncogenes Few methods effectively identify rare oncogenes

  • 1. Oncogenes as annotated in COSMIC (Forbes et al. Nucleic Acids Research 2014)
slide-18
SLIDE 18

Ben Lengerich | ISMB 2018

Personalized Regression Produces Sample-Specific Pan-Cancer Models

18

Samples Genes

Red Line = oncogene

slide-19
SLIDE 19

Ben Lengerich | ISMB 2018

Personalized Models Reveal Molecular Subtypes Which Span Tissues

19

Samples Genes

  • Over-represented for the GO

biological process term “Modulation

  • f Chemical Synaptic

Transmission" (p <0.05FDR)

  • Includes genes ATP1A2, SLC6A4,

ASIC1, GRM3, and SLC8A3, which code for ion-transport processes.

  • Ion-transport processes have long

been seen in vivo as an important system in thyroid cancer [1] and in vitro from leukemic cells [2], but

  • nly recently as a functional

marker across different cancer types [3].

1. Filetti et al. European Journal of Endocrinology 1999 2. Morgan et al. Cancer Research 1986 3. Scafoglio et al. PNAS 2015

slide-20
SLIDE 20

Ben Lengerich | ISMB 2018

Personalized Models Form Clusters with Distinct Signatures

20

Extracellular Processes - Antigen Cellular Metabolism Extracellular Processes - Membrane

slide-21
SLIDE 21

Ben Lengerich | ISMB 2018

Personalized Regression Learns Clinical Distance Metrics

21

slide-22
SLIDE 22

Ben Lengerich | ISMB 2018

Conclusions

  • Sample-specific models can give us a new perspective.
  • Unlock bottom-up in addition to traditional top-down

analyses.

  • Personalized Regression with Distance-Matching

Regularization effectively learns sample-specific models.

  • Personalized Regression reveals patterns in pan-cancer

transcriptomic data that are overlooked by traditional analyses.

22

slide-23
SLIDE 23

Ben Lengerich | ISMB 2018

Future Work

  • Biological Questions - Sample-Specific Processes?
  • More complex personalized models
  • Personalized Regression for Single-Cell Data,

Election Modeling, Stock Prediction

23

slide-24
SLIDE 24

Ben Lengerich | ISMB 2018

Code available at: github.com/blengerich/ personalized_regression Collaborators:

  • Bryon Aragam
  • Eric P

. Xing

  • Contact: {blengeri, epxing}

@cs.cmu.edu Travel to ISMB generously supported by ISCB Research supported by NIH

Thank You

24

slide-25
SLIDE 25

The Gory Details

25

slide-26
SLIDE 26

Ben Lengerich | ISMB 2018

Personalized Regression: Optimization

  • Define pairwise distance vectors by:
  • Construction of the covariate distance tensor can be amortized

26

Δ(i,j)

β

= [dβ1(β(i)

1 , β(j) 1 ), …, dβP(β(i) P , β(j) P )]

Δ(i,j)

U

= [dU1(U(i)

1 , U(j) 1 ), …, dUK(U(i) K , U(j) K )]

slide-27
SLIDE 27

Ben Lengerich | ISMB 2018

Avoiding Degenerate Solutions

  • Add priors to distance metrics
  • From:
  • To:

27

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2 + ψα(dβ) + ψυ(dU)

slide-28
SLIDE 28

Ben Lengerich | ISMB 2018

Avoiding Degenerate Solutions

  • Add priors to distance metrics
  • where
  • and we project loadings into the non-negative reals.

28

ϱ(i)

γ (dβ, dU) = γ∑ j≠i

( dβ(β(i), β(j))

parameter distance

− dU(U(i), U(j))

covariate distance

)

2 + ψα(dβ) + ψυ(dU)

ψα(dβ) = α||ϕβ − ϕ0

beta||2

ψυ(dU) = υ||ϕU − ϕ0

U||2

slide-29
SLIDE 29

Ben Lengerich | ISMB 2018

Personalized Regression

  • Initialize at population solution
  • Allow each personalized model

to “fine-tune” away from the central population solution (block coordinate descent)

29

  • Distance-matching

regularization ensures the personalized models respect covariate structure

slide-30
SLIDE 30

Ben Lengerich | ISMB 2018

Inference Procedure

  • Conveniently, we

have already learned distance metrics to use for predictions.

  • On test data, we

identify the closest neighbors and use their sample-specific models.

30

slide-31
SLIDE 31

Ben Lengerich | ISMB 2018

2550 125 250 500

Number of Samples

0.4 0.5 0.6 0.7 0.8 0.9 1.0

||ˆ β−β||2 ||ˆ βpop−β||2

Population Mixture VC Personalized

Simulation Results

  • At moderate sample sizes, personalized regression recovers parameters well.
  • At low sample size, cannot learn distance metrics.

31

Recovery Error (Lower is Better)

slide-32
SLIDE 32

Ben Lengerich | ISMB 2018

Personalized Regression Fine-Tunes Accuracy

  • Here, personalized regression
  • verfits the data but is still

better than competing methods.

  • Better clinical distance metrics

and hyperparameter tuning will likely alleviate overfitting.

32

slide-33
SLIDE 33

Ben Lengerich | ISMB 2018

Personalized Regression Does Not Merely Identify More Enriched Gene Sets

Instead, it identifies a variety of sample-specific patterns which do not fit into a small number of mixtures

33

Enrichment Analysis of Complete Rankings: