CAMDA 03: Weakest Link Models for Detecting Small Groups of Genes - - PowerPoint PPT Presentation
CAMDA 03: Weakest Link Models for Detecting Small Groups of Genes - - PowerPoint PPT Presentation
CAMDA 03: Weakest Link Models for Detecting Small Groups of Genes to Predict Lung Cancer Survival Presenter: Thomas J. Richards, Ph.D. November 13, 2003 Affiliation: Dorothy P. & Richard P. Simmons Center for Interstitial Lung
Dorothy P. & Richard P. Simmons Center for Interstitial Lung Diseases Affiliation: Division of Pulmonary, Allergy, and Critical Care Medicine University of Pittsburgh
in the
In Collaboration with: Roger S. Day, Sc.D.
University of Pittsburgh
Department of Biostatistics
and
University of Pittsburgh Cancer Institute
Weakest Link Models
- Make sense in biology;
- Can be applied to gene expression data;
- May identify novel gene interactions.
Response: Plant Growth
5 Necessary factors:
- Water;
- Sunlight;
- P;
- K;
- Ca;
How do factors combine to effect plant growth?
They don’t work together like this…
They don’t work together like this…
They may work together like this…
They may work together like this…
Water Sun Water Sun Traditional Models Weakest link model excess water excess sun
Contour plots of E(Y|X)
“Curve of Optimal Use (COU)” Reality?
They may work together like this…
Or like this…
Or like this…
Or like this…
Source: H. Frederik Nijhout, American Scientist (2003)
The Weakest Link Idea
E(Yi) = minj{ϕj(xij; θj): j = 1, …, m}
- Usually, ϕj= ϕ for all j;
- Weakest link gene minimizes ϕ;
- Each patient has his/her own weakest link;
WL Model for Binary Response Data:
ϕj(xij; θj) = logit –1 (αj+ βjxij) E(Yi) = minj{logit –1 (αj+ βjxij) : j = 1, …, m} and θj = (αj, βj).
Parametric Weakest Link (PWL) Model
Parametric Weakest Link Model For Survival Data
λ (t; xij) = λ0(t)exp[minj{ϕj(xij; θj)}] λ (t; xij) = λ0(t)exp[minj{βjxij}]
Quantile-Matching Weakest Link (QWL) Model:
Curve of Optimal Use:
,
- ?
F
- r
CDF;
1
- 1
1 ? 1 2 1
? ? ? ?
f p 1 1 f p f p 1 1 f f p f F F p F F p
− −
+
= − − −∆ = = − − +∆ =
Normal Logistic
2
?
p.
Data Pre-processing: Simplify!
Simplify the process, minimize data handling:
- Affy:
- Run RMA, then generate ratios.
- cDNA arrays: use ratios.
- Focus on known genes only;
- 2000 LocusLink IDs in all 4 data sets;
Approach to Data Analysis
Gene Selection: Based on substantive hypotheses;
- Use DAVID at NIAID to get gene classes:
- Not optimal, but necessary in this case;
Approach to Data Analysis
Groups of genes, from DAVID:
- Cell Cycle (CELL, 24 genes);
- Apoptosis (AP, 12 genes);
- Extracellular Matrix (ECM, 18 genes);
- Matrix Metalloproteinases (MMPs, 10 genes);
- WNT Pathway (11 genes).
Approach to Data Analysis
Form dyads of genes, for testing:
- CELL.AP (288), CELL.ECM (432), …
- AP.ECM (216), AP.MMP (120), …
Etc.
- Pair up all of the above genes with 45 genes
from the Beer supplemental data.
Approach to Data Analysis
Use profile likelihood to estimate a COU for each pair of genes; Use Bonferroni-by-4 on the p-values; For the direction, take the smallest of the four p-values.
Selected Results
CELL.AP: 60 of 288 had adjusted p < 0.05. ECM.MMP: 37 of 180 had adjusted p < 0.05. ECM.BEER: 299 of 810 had adjusted p < 0.05. WNT.BEER: 152 of 495 had adjusted p < 0.05.
Selected Results
CELL.AP, 60 significant pairs: 5 minp1p2; 17 maxp1p2; 13 maxp1q2; 25 minp1q2. ECM.MMP, 37 significant pairs: 2 minp1p2; 6 maxp1p2; 11 maxp1q2; 18 minp1q2. ECM.BEER, 299 significant pairs: 60 minp1p2; 65 maxp1p2; 100 maxp1q2; 74 minp1q2. WNT.BEER, 152 significant pairs: 32 minp1p2; 19 maxp1p2; 56 maxp1q2; 45 minp1q2.
Selected Results
LocusLink ID = 4175, a Cell Cycle component, MCM6, minichromosome maintenance deficient 6 (S.cerevisae), involved in initiating replication. Biological interaction with 7 LocusLink IDs in the apoptosis class (5 in same direction): 2 minp1p2: TRAF1, TNFRSF1B; 3 maxp1p2: SFRS2IP, MCL1, TRADD; 1 maxp1q2: CRADD (good prognosis) 1 minp1q2: BCL2L2
MCM6
- MCM’s 2- 7 binds to DNA after mitosis and
enable DNA replication.
- MCM2 is a biomarker of proliferating cells and
a marker for premalignant lung cells.
- MCM6 is in a chromosomal region that is
amplified in lung cancer and its mRNA level is also increased (Kaminski, Dehan unpublished data)
Selected Results II
Can we find unexpected interactions? Biological interactions between Beer & ECM? ECM genes show up in every cancer dataset. Fibronectin is a predictor of melanoma invasiveness.
- Is a known marker of bad prognosis
- Interacts significantly with at least 4 ECM
genes
- Vitronectin maxp1p2 (Good Prognosis !)
- Collagen 1A2 maxp1q2
- Collagen 9A2 minp1q1
- Collagen 5A1 minp1q1
PAI-1 (Plasminogen Inhibitor 1)
Does it make sense?
- Elevated PAI-1 activities are associated with
coronary thrombosis and with a poor prognosis in many cancers
- Vitronectin binding extends the lifetime of active
PAI-1, which controls hemostasis and has also been implicated in angiogenesis.
- The PAI-1 effects on cell adhesion and motility
depend on vitronectin binding…
Conclusions
Weakest Link Models:
- Make sense in biology;
- Can be applied to gene expression data;
- May identify novel gene interactions.
Next Steps
- Validation on independent data set;
- Extend from dyads to triads;
- Use tryads to explore pathways;
- Extend to arbitrary number of genes.
Acknowledgements:
Naftali Kaminski, M.D. Director, Dorothy P. & Richard P. Simmons Center for Interstitial Lung Diseases Public Defenders’ Association
Supplementary Slides
18-Nov-03 Introduction: Motivation for Model 43
Potential Problems with Linear Models
- Mechanistic model, not just predictive.
- Several covariates impact a response.
– Example: immune response in Melanoma.
- Each covariate is “necessary.”
– Necessary = “Necessary to impact response probability.”
- Logistic Model is unrealistic:
18-Nov-03 Introduction: Motivation for Model 44
– Increasing a covariate always has an effect. – One covariate can be traded off for another.
- Example: Branch, Bryant, et al (1997): N-
acetyltransferase Metabolic Activity and Bladder Cancer.
– Goal: determine role of N-acetyltransferase slow acetylator phenotype in susceptibility to
- ccupationally related aggressive bladder
cancer. – Problem: possible interaction without main effect.
Interaction without Main Effects
- For categorical data, not a new idea:
– “Synergism” in BFH (1975). – 2 x 2 x 2 contingency table. – BFH cite Worcester [1971] model, for thromboembolism data.
- My adaptation of BFH…
– (To SWP3.0)
- Est. RR (Controlling age, sex,
alcohol, tobacco)
Occupational exposure Acetylator Phenotype Unexposed Exposed Fast 1.0 1.0 Slow 1.1 8.0 (1.9, 3.4) = 95% ci. p < 0.01
Occupational exposure Acetylator Phenotype Unexposed Exposed Fast 1.0 1.0 Slow 1.1 8.0 (1.9, 3.4) = 95% ci. p < 0.01
Is there “synergy”, or “synergism”, here?
( ) ( ) ( ) ( )
- 1
1 ? 2
- 1
1 ? 2
- 1
1 ? 2
- 1
1 ? 2
1 2 i i i ?
p , f p ; or p , f p ; or p , 1-f p ; or p , 1-f p ;
, ; logit p a ß? , where ? , and f : 0,1 0,1
min max max min
i i
E Y X X π
= = + = →
( )
?
1
is defined by f p 1 1 ,where F is a symmetric distribution function. F F p
−
= − − −∆
The Quantile-Matching Weakest Link (QWL) Model
In p1-p2 space, the unit square, define a new covariate, one of: ρ = min{p1, p2} (minp1p2)
ρ = max{p1, p2} (maxp1p2) ρ = max{p1, 1 - p2} (maxp1q2) ρ = min{p1, 1 - p2} (minp1q2)
QWL Model E[Yi | X1, X2] = α + β ρi
For binary response data: For survival data:
λ(t; xi) = λ0(t)exp(β ρi)
Fitting this QWL Model: Done.
- 1
? 2 2 ? 1 i 1 2 ? 1
f p if p f p ? if p f p p
> = <
Type B
1 2 ? 1 i
- 1
? 2 2 ? 1
if p f 1-p ? 1-f p if p f 1-p p
> = <
Type C
- 1
? 2 2 ? 1 i 1 2 ? 1
1-f p if p f 1-p ? if p f 1-p p
> = <
Type D
1 2 ? 1 i
- 1
? 2 2 ? 1
if p f p ? f p if p f p p
> = <
Type A
if p f p 1 2 ? 1
- 1
? min p , f p 1 ? 2 i
- 1
f p if p f p ? 2 2 ? 1 p
> = = <
- 1
f p if p f p ? 2 2 ? 1
- 1
? max p , f p 1 ? 2 i if p f p 1 2 ? 1 p
> = = < if p f 1-p 1 2 ? 1
- 1
? max p , 1-f p 1 ? 2 i
- 1
1-f p if p f 1-p ? 2 2 ? 1 p
> = = <
- 1
1-f p if p f 1-p ? 2 2 ? 1
- 1
? min p , 1-f p 1 ? 2 i if p f 1-p 1 2 ? 1 p
> = = <
A B C D
Simple Expressions for Covariates
5 10 15 200 400 600 800 1000 1200 % DR+CD8+ Lymphocytes # CD8+ Lymphocytes . 3 7 . 3 7 . 4 5 . 4 5 . 5 3 . 5 3 . 6 2 . 6 2 . 7 . 7 . 1 4 . 1 4 . 2 . 2 . 2 6 . 2 6 . 3 1 . 3 1 . 3 7 . 3 7 . 7 . 7 . 7 4 . 7 4 . 7 9 . 7 9 . 8 4 . 8 4 . 8 9 . 8 9 0.41 0.41 0.47 0.47 0.52 0.52 0.58 0.58 0.64 0.64 0.7 0.7 0.76 0.76 0.81 0.81 0.87 0.87
2 4 6 8 % DR+CD8+ Lymphocytes 1-yr DFS 1 200 400 600 800 1000 # CD8+ Lymphocytes 1-yr DFS 1
- 4
- 2
2 4
- 4
- 2
2 4 x1 x2 . 8 . 8 . 1 5 . 1 5 . 2 3 . 2 3 . 3 1 . 3 1 . 3 1 . 3 1 . 3 5 . 3 5 . 3 9 . 3 9 . 4 2 . 4 2 . 4 6 . 4 6 . 4 6 . 4 6 . 6 . 6 . 7 3 . 7 3 . 8 7 . 8 7 1 1 0.02 0.02 0.13 0.13 0.24 0.24 0.34 0.34 0.45 0.45 0.56 0.56 0.67 0.67 0.78 0.78 0.89 0.89
> plot(WL.qregf)
- 3
- 2
- 1
1 2 3 x1 y 1
- 2
2 4 x2 y 1
> plot(ES,sm.h=c(0.5,0.5),xlab1="x1",xlab2="x2",ylab="y")
Weaklink Software
Here, we generate a 1000-observation data set from a Weakest Link Model with binary response, and see how a logistic regression model fails to fit the data. First, generate the objects with 3 easy commands: > WL <- SimBinaryWL(Theta=c(1.0,2.0,1.0,3.0),n.obs=1000,x.vcv=diag(rep(2,2))) > WL.qregf <- qregf(vnames=c("x1","x2"),dframe="DD",binresp="y") > ES <- EpsSelect(WL.qregf,pcontour=0.48,epsilon=1)
- Analyze existing data; or
- Simulate WL data.
- Model Signatures.
Next, plot the objects …
Results from software
- 1.04E+02
1.17820 5 7.39E- 05 maxp1q 2 2.96E- 04 S100 calcium binding protein P fibronectin 1
- 1.09E+02
1.16499 1 4.75E- 04 maxp1q 2 1.90E- 03 S100 calcium binding protein P collagen, type XI, alpha 1
- 1.14E+02
1.18475 5 2.54E- 04 maxp1q 2 1.02E- 03 S100 calcium binding protein P collagen, type IX, alpha 3
- 1.17E+02
1.17345 4 1.66E- 05 maxp1q 2 6.62E- 05 S100 calcium binding protein P tenascin C (hexabrachion) beta stage MinPval Directio n Bonferro ni name2 name1