Variable Selection and Model Choice in Survival Models with - - PowerPoint PPT Presentation

variable selection and model choice in survival models
SMART_READER_LITE
LIVE PREVIEW

Variable Selection and Model Choice in Survival Models with - - PowerPoint PPT Presentation

Variable Selection and Model Choice in Survival Models with Time-Varying Effects Boosting Survival Models Benjamin Hofner 1 Department of Medical Informatics, Biometry and Epidemiology (IMBE) Friedrich-Alexander-Universit at Erlangen-N


slide-1
SLIDE 1

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Boosting Survival Models Benjamin Hofner 1

Department of Medical Informatics, Biometry and Epidemiology (IMBE) Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg

joint work with Thomas Kneib and Torsten Hothorn

Department of Statistics Ludwig-Maximilians-Universit¨ at M¨ unchen

useR! 2008

1benjamin.hofner@imbe.med.uni-erlangen.de

slide-2
SLIDE 2

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Introduction

Cox PH model: λi(t) = λ(t, xi) = λ0(t) exp(x′

iβ)

with λi(t) hazard rate of observation i [i = 1, . . . , n] λ0(t) baseline hazard rate xi vector of covariates for observation i [i = 1, . . . , n] β vector of regression coefficients Problem: restrictive model, not allowing for non-proportional hazards (e.g., time-varying effects) non-linear effects

slide-3
SLIDE 3

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Additive Hazard Regression

Generalisation: Additive Hazard Regression

(Kneib & Fahrmeir, 2007)

λi(t) = exp(ηi(t)) with ηi(t) =

J

  • j=1

fj(xi(t)), generic representation of covariate effects fj(xi) a) linear effects: fj(xi(t)) = flinear(˜ xi) = ˜ xiβ b) smooth effects: fj(xi(t)) = fsmooth(˜ xi) c) time-varying effects: fj(xi(t)) = fsmooth(t) · ˜ xi where ˜ xi ∈ xi(t). Note: c) includes log-baseline for ˜ xi ≡ 1

slide-4
SLIDE 4

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

P-Splines

flexible terms can be represented using P-splines

(Eilers & Marx, 1996)

model term (x can be either ˜ xi or t): fj(x) =

M

  • m=1

βjmBjm(x) (j = 1, . . . , J) penalty: penj(βj) = κj βj′Kβj cases b),c) case a) with K = D′D (i.e., cross product of difference matrix D) D

e.g.

= 1 −2 1 . . . 1 −2 1 . . .

  • κj smoothing parameter

(larger κj ⇒ more penalization ⇒ smoother fit)

slide-5
SLIDE 5

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Inference

Penalized Likelihood Criterion:

(NB: this is the full log-likelihood)

Lpen(β) =

n

  • i=1
  • δiηi(ti) −

ti exp(ηi(t)) dt

J

  • j=0

penj(βj)

Ti true survival time Ci censoring time ti = min(Ti, Ci) observed survival time (right censoring) δi = 1(Ti ≤ Ci) indicator for non-censoring

Problem: Estimation and in particular model choice

slide-6
SLIDE 6

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

CoxflexBoost

Aim: Maximization of a (potentially) high-dimensional log-likelihood with different modeling alternatives Thus, we use: Iterative algorithm Likelihood-based boosting algorithm Component-wise base-learners Therefore: Use one base-learner gj(·) for each covariate (or each model component) [ j ∈ {1, . . . , J} ] Component-Wise Boosting as a means of estimation and variable selection combined with model choice.

slide-7
SLIDE 7

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

CoxflexBoost

Aim: Maximization of a (potentially) high-dimensional log-likelihood with different modeling alternatives Thus, we use: Iterative algorithm Likelihood-based boosting algorithm Component-wise base-learners Therefore: Use one base-learner gj(·) for each covariate (or each model component) [ j ∈ {1, . . . , J} ] Component-Wise Boosting as a means of estimation and variable selection combined with model choice.

slide-8
SLIDE 8

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

CoxflexBoost

Aim: Maximization of a (potentially) high-dimensional log-likelihood with different modeling alternatives Thus, we use: Iterative algorithm Likelihood-based boosting algorithm Component-wise base-learners Therefore: Use one base-learner gj(·) for each covariate (or each model component) [ j ∈ {1, . . . , J} ] Component-Wise Boosting as a means of estimation and variable selection combined with model choice.

slide-9
SLIDE 9

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

CoxflexBoost Algorithm

(i) Initialization: Iteration index m := 0.

Function estimates (for all j ∈ {1, . . . , J}): ˆ f [0]

j

(·) ≡ 0 Offset (MLE for constant log hazard): ˆ η[0](·) ≡ log n

i=1 δi

n

i=1 ti

slide-10
SLIDE 10

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

(ii) Estimation: m := m + 1. Fit all (linear/P-spline) base-learners separately ˆ gj = gj(· ; ˆ βj), ∀j ∈ {1, . . . , J}, by penalized MLE, i.e.,

ˆ βj = arg max β L[m]

j,pen(β)

with the penalized log-likelihood ( analogously as above ) L[m]

j,pen(β)

=

n

  • i=1
  • δi · (ˆ

η[m−1]

i

+ gj(xi(ti); β)) − ti exp

  • ˆ

η[m−1]

i

(˜ t) + gj(xi(˜ t); β)

t

  • − penj(β),

with the additive predictor ηi split

into the estimate from previous iteration ˆ η[m−1]

i

and the current base-learner gj(·; β)

slide-11
SLIDE 11

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

(iii) Selection: Choose base-learner ˆ gj∗ with j∗ = arg max

j∈{1,...,J} L[m] j,unpen(ˆ

βj) (iv) Update:

Function estimates (for all j ∈ {1, . . . , J}): ˆ f [m]

j

= ˆ f [m−1]

j

+ ν · ˆ gj j = j∗ ˆ f [m−1]

j

j = j∗ Additive predictor (= fit): ˆ η[m] = ˆ η[m−1] + ν · ˆ gj∗

with step-length ν ∈ (0, 1] (here: ν = 0.1) (v) Stopping rule: Continue iterating steps (ii) to (iv) until m = mstop

slide-12
SLIDE 12

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Some Aspects of CoxflexBoost

Estimation full penalized MLE · ν (step-length) Selection based on unpenalized log-likelihood L[m]

j,unpen

Base-Learners specified by (initial) degrees of freedom, i.e., dfj = df j Likelihood-based boosting (in general): See, e.g., Tutz and Binder (2006) Above aspects in CoxflexBoost: See, e.g., model based boosting (B¨ uhlmann & Hothorn, 2007)

slide-13
SLIDE 13

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Degrees of Freedom

Specifying df more intuitive than specifying smoothing parameter κ Comparable to other modeling components, e.g., linear effects Problem: Not constant over the (boosting) iterations

But simulation studies showed: No big deviation from the initial dfj = df j

200 400 600 800 0.0 0.2 0.4 0.6 0.8 1.0 bbs(x3) boosting iteration m df(m)

Estimated degrees of freedom traced

  • ver the boosting steps for the flexi-

ble base-learners of x3 (in 200 repli- cates) and initially specified degrees

  • f freedom (dashed line).
slide-14
SLIDE 14

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Model Choice

Recall from generic representation: fj(˜ xi) can be a a) linear effect: fj(xi(t)) = flinear(˜ xi) = ˜ xiβ b) smooth effect: fj(xi(t)) = fsmooth(˜ xi) c) time-varying effect: fj(xi(t)) = fsmooth(t) · ˜ xi

⇒ We see: ˜ xi can enter the model in 3 different ways But how? Add all possibilities as base-learners to the model. Boosting can chose between the possibilities But the df must be comparable! Otherwise: more flexible base-learners are preferred

slide-15
SLIDE 15

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Model Choice

Recall from generic representation: fj(˜ xi) can be a a) linear effect: fj(xi(t)) = flinear(˜ xi) = ˜ xiβ b) smooth effect: fj(xi(t)) = fsmooth(˜ xi) c) time-varying effect: fj(xi(t)) = fsmooth(t) · ˜ xi

⇒ We see: ˜ xi can enter the model in 3 different ways But how? Add all possibilities as base-learners to the model. Boosting can chose between the possibilities But the df must be comparable! Otherwise: more flexible base-learners are preferred

slide-16
SLIDE 16

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

For higher order differences (d ≥ 2): df > 1 (κ → ∞) Polynomial of order d − 1 remains unpenalized Solution: Decomposition (based on Kneib, Hothorn, & Tutz, 2008) g(x) = β0 + β1x + . . . + βd−1xd−1

  • unpenalized, parametric part

+ gcentered(x)

  • deviation from polynomial

Add unpenalized part as separate, parametric base-learners Assign df = 1 to the centered effect (and add as P-spline base-learner) Analogously for time-varying effects

Technical realization (see Fahrmeir, Kneib, & Lang, 2004): decomposing the vector of regression coefficients β into ( βunpen, βpen) utilizing a spectral decomposition of the penalty matrix

slide-17
SLIDE 17

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Early Stopping

1 Run the algorithm mstop-times (previously defined). 2 Determine new

mstop,opt ≤ mstop:

... based on out-of-bag sample (with simulations easy to use) ... based on information criterion, e.g., AIC

⇒ Prevents algorithm to stop in a local maximum (of the log-likelihood) ⇒ Early stopping prevents overfitting

slide-18
SLIDE 18

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Variable Selection and Model Choice

... is achieved by selection of base-learner (in step (iii) of CoxflexBoost), i.e., component-wise boosting and early stopping Simulation-Results (in Short) Good variable selection strategy Good model choice strategy if only linear and smooth effects are used Selection bias in favor of time-varying base-learners (if present) ⇒ standardizing time could be a solution Estimates are better if model choice is performed

slide-19
SLIDE 19

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Computational Aspects

CoxflexBoost is implemented using R Crucial computation: Integral in L[m]

j,pen(β):

ti exp

  • ˆ

η[m−1]

i

(˜ t) + gj(xi(˜ t); β)

t

time consuming very often evaluated (maximization of L[m]

j,pen(β))

R-function integrate() slow in this context ⇒ (specialized) vectorized trapezoid integration implemented ⇒ ≈ 100 times quicker Efficient storage of matrices can reduce computational burden ⇒ recycling of results

slide-20
SLIDE 20

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Computational Aspects

CoxflexBoost is implemented using R Crucial computation: Integral in L[m]

j,pen(β):

ti exp

  • ˆ

η[m−1]

i

(˜ t) + gj(xi(˜ t); β)

t

time consuming very often evaluated (maximization of L[m]

j,pen(β))

R-function integrate() slow in this context ⇒ (specialized) vectorized trapezoid integration implemented ⇒ ≈ 100 times quicker Efficient storage of matrices can reduce computational burden ⇒ recycling of results

slide-21
SLIDE 21

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Computational Aspects

CoxflexBoost is implemented using R Crucial computation: Integral in L[m]

j,pen(β):

ti exp

  • ˆ

η[m−1]

i

(˜ t) + gj(xi(˜ t); β)

t

time consuming very often evaluated (maximization of L[m]

j,pen(β))

R-function integrate() slow in this context ⇒ (specialized) vectorized trapezoid integration implemented ⇒ ≈ 100 times quicker Efficient storage of matrices can reduce computational burden ⇒ recycling of results

slide-22
SLIDE 22

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Summary & Outlook

CoxflexBoost . . . . . . allows for variable selection and model choice. . . . allows for flexible modeling

flexible, non-linear effects time-varying effects (i.e., non-proportional hazards)

. . . provides functions to manipulate and show results (summary(), plot(), subset(), . . . ) To be continued . . . Formula for AIC (for Boosting in Survival Models) Include mandatory covariates (update in each step) Measure for variable importance: e.g.,

f [mstop]

j

(·)|

slide-23
SLIDE 23

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Summary & Outlook

CoxflexBoost . . . . . . allows for variable selection and model choice. . . . allows for flexible modeling

flexible, non-linear effects time-varying effects (i.e., non-proportional hazards)

. . . provides functions to manipulate and show results (summary(), plot(), subset(), . . . ) To be continued . . . Formula for AIC (for Boosting in Survival Models) Include mandatory covariates (update in each step) Measure for variable importance: e.g.,

f [mstop]

j

(·)|

slide-24
SLIDE 24

Introduction Technical Preparations CoxflexBoost Summary / Outlook References

Literature

B¨ uhlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477-505. Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. Fahrmeir, L., Kneib, T., & Lang, S. (2004). Penalized structured additive regression: A Bayesian perspective. Statistica Sinica, 14, 731–761. Kneib, T., & Fahrmeir, L. (2007). A mixed model approach for geoadditive hazard regression. Scand. J. Statist., 34, 207–228. Kneib, T., Hothorn, T., & Tutz, G. (2008). Variable selection and model choice in geoadditive regression. Biometrics (accepted). Tutz, G., & Binder, H. (2006). Generalized additive modelling with implicit variable selection by likelihood-based boosting. Biometrics, 62, 961–971.

slide-25
SLIDE 25

l with model choice

−1.0 −0.5 0.0 0.5 1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

x1 log(hazard rate)

l without model choice

−1.0 −0.5 0.0 0.5 1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

x1 log(hazard rate)

−2 −1 1 2 −3 −2 −1 1 2 3

x log(hazard rate)

−3 −2 −1 1 2 −2 2

x log(hazard rate)

slide-26
SLIDE 26

2 4 6 8 10 −6 −4 −2 2

time log(hazard rate)

2 4 6 8 10 −6 −4 −2 2

time log(hazard rate)