Boosting: more than an ensemble method for prediction Peter B - - PowerPoint PPT Presentation

boosting more than an ensemble method for prediction
SMART_READER_LITE
LIVE PREVIEW

Boosting: more than an ensemble method for prediction Peter B - - PowerPoint PPT Presentation

Boosting: more than an ensemble method for prediction Peter B uhlmann ETH Z urich 1 1. Historically: Boosting is about multiple predictions Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (i.i.d. or stationary),


slide-1
SLIDE 1

✬ ✫ ✩ ✪ Boosting: more than an ensemble method for prediction

Peter B¨ uhlmann ETH Z¨ urich

1

slide-2
SLIDE 2

✬ ✫ ✩ ✪

  • 1. Historically: Boosting is about multiple predictions

Data: (X1, Y1), . . . , (Xn, Yn) (i.i.d. or stationary), predictor variables Xi ∈ Rp response variables Yi ∈ R or Yi ∈ {0, 1, . . . , J − 1} Aim: estimation of function f(·) : Rp → R, e.g.

f(x) = I E[Y |X = x] or f(x) = I P[Y = 1|X = x] with Y ∈ {0, 1}

  • r distribution of survival time Y given X depends on some function f(X) only

“historical” view (for classification): Boosting is a multiple predictions (estimation) & combination method

2

slide-3
SLIDE 3

✬ ✫ ✩ ✪

Base procedure:

data

algorithm A

− →

ˆ θ(·) (a function estimate)

e.g.: simple linear regression, tree, MARS, “classical” smoothing, neural nets, ... Generating multiple predictions: weighted data 1 algorithm A

− → ˆ θ1(·)

weighted data 2 algorithm A

− → ˆ θ2(·) · · · · · ·

weighted data M algorithm A

− → ˆ θM(·)

Aggregation: ˆ

fA(·) = M

m=1 amˆ

θm(·)

data weights? averaging weights am?

3

slide-4
SLIDE 4

✬ ✫ ✩ ✪

classification of 2 lymph nodal status in breast cancer using gene expressions from microarray data:

n = 33, p = 7129 (for CART: gene-preselection, reducing to p = 50)

method test set error gain over CART CART 22.5% – LogitBoost with trees 16.3% 28% LogitBoost with bagged trees 12.2% 46% this kind of boosting: mainly prediction, not much interpretation

4

slide-5
SLIDE 5

✬ ✫ ✩ ✪

  • 2. Boosting algorithms

around 1990: Schapire constructed some early versions of boosting AdaBoost proposed for classification by Freund & Schapire (1996) data weights (rough original idea): large weights to previously heavily misclassified instances (sequential algorithm) averaging weights am: large if in-sample performance in mth round was good Why should this be good?

5

slide-6
SLIDE 6

✬ ✫ ✩ ✪

Why should this be good? some common answers 5 years ago ... because

  • it works so well for prediction (which is quite true)
  • it concentrates on the “hard cases” (so what?)
  • AdaBoost almost never overfits the data no matter how many iterations it is run

(not true)

6

slide-7
SLIDE 7

✬ ✫ ✩ ✪

A better explanation Breiman (1998/99): AdaBoost is functional gradient descent (FGD) procedure aim: find f ∗(·) = argminf(·)I

E[ρ(Y, f(X))]

e.g. for ρ(y, f) = |y − f|2 f ∗(x) = I

E[Y |X = x]

FGD solution: consider empirical risk n−1 n

i=1 ρ(Yi, f(Xi)) and

do iterative steepest descent in function space

7

slide-8
SLIDE 8

✬ ✫ ✩ ✪ 2.1. Generic FGD algorithm

Step 1. ˆ

f0 ≡ 0; set m = 0.

Step 2. Increase m by 1. Compute negative gradient − ∂

∂f ρ(Y, f)

and evaluate at f = ˆ

fm−1(Xi) = Ui (i = 1, . . . , n)

Step 3. Fit negative gradient vector U1, . . . , Un by base procedure

(Xi, Ui)n

i=1

algorithm A

− → ˆ θm(·)

e.g. ˆ

θm fitted by (weighted) least squares

i.e. ˆ

θm(·) is an approximation of the negative gradient vector

Step 4. Up-date ˆ

fm = ˆ fm−1(·) + νsm · ˆ θm(·) sm = argminsn−1 n

i=1 ρ(Yi, ˆ

fm−1(Xi) + s · ˆ θm(Xi)) and 0 < ν ≤ 1

i.e. proceed along an estimate of the negative gradient vector Step 5. Iterate Steps 2-4 until m = mstop for some stopping iteration mstop

8

slide-9
SLIDE 9

✬ ✫ ✩ ✪

Why “functional gradient”? Alternative formulation in function space: empirical risk functional: C(f) = n−1 n

i=1 ρ(Yi, f(Xi))

inner product:

f, g = n−1 n

i=1 f(Xi)g(Xi)

negative Gateaux derivative:

−dC(f)(x) = ∂ ∂αC(f + α1x)|α=0, −dC( ˆ fm−1)(Xi) = Ui

if U1, ..., Un are fitted by least squares: equivalent to maximize −dC(fm), θ w.r.t. θ(·) (ifθ = 1) (over all possible θ(·)’s from the base procedure) i.e: ˆ

θm(·) is the best approximation (most parallel)

to the negative gradient −dC(fm)

9

slide-10
SLIDE 10

✬ ✫ ✩ ✪

By definition: FGD yields additive combination of base procedure fits

ν mstop

m=1 smˆ

θm(·)

Breiman (1998): FGD with ρ(y, f) = exp((2y − 1) · f) for binary classification yields the AdaBoost algorithm (great result!) Remark: FGD can not be represented as some explicit estimation function(al):

ˆ fm(·)=argminf∈Fn−1

n

  • i=1

ρ(Yi, f(Xi))

for some function class F

FGD is mathematically more difficult to analyze but

generically applicable (as an algorithm!) in very complex models

10

slide-11
SLIDE 11

✬ ✫ ✩ ✪ 2.2. L2Boosting

(see also Friedman, 2001) loss function ρ(y, f) = |y − f|2 population minimizer: f ∗(x) = I

E[Y |X = x]

FGD with base procedure ˆ

θ(·): repeated fitting of residuals m = 1 : (Xi, Yi)n

i=1 ˆ

θ1(·), ˆ f1 = νˆ θ1

  • resid. Ui = Yi − ˆ

f1(Xi) m = 2 : (Xi, Ui)n

i=1 ˆ

θ2(·), ˆ f2 = ˆ f1 + νˆ θ2

  • resid. Ui = Yi − ˆ

f2(Xi) ... ... ˆ fmstop(·) = ν mstop

m=1 ˆ

θm(·) (stagewise greedy fitting of residuals)

Tukey (1977): twicing for mstop = 2 and ν = 1

11

slide-12
SLIDE 12

✬ ✫ ✩ ✪

Any gain over classical methods? (for additive modeling)

Ozone data: n=300, p=8

boosting iterations MSE 20 40 60 80 100 18 19 20 21 22

n = 300, p = 8

  • magenta: L2Boosting with stumps

(horiz. line = cross-validated stopping)

  • black: L2Boosting with componentwise

smoothing spline (horiz. line = cross-validated stopping) i.e: smoothing spline fi tting against the selected predictor which reduces RSS most

  • green: MARS restricted to additive modeling
  • red: additive model using backfi tting

L2Boosting with stumps or comp. smoothing splines also yields additive model: mstop

m=0 ˆ

θm(x( ˆ

Sm)) = ˆ

g1(x(1)) + . . . + ˆ gp(x(p))

12

slide-13
SLIDE 13

✬ ✫ ✩ ✪

Simulated data: non-additive regression function, n = 200, p = 100

Regression: n=200, p=100

boosting iterations MSE 50 100 150 200 250 300 11 12 13 14 15 16

  • magenta: L2Boosting with stumps
  • black: L2Boosting with componentwise
  • green: MARS restricted to additive modeling
  • red: additive model using backfi tting and
  • fwd. var. selection

13

slide-14
SLIDE 14

✬ ✫ ✩ ✪

similar for classifi cation

14

slide-15
SLIDE 15

✬ ✫ ✩ ✪

  • 3. Structured models and choosing the base procedure

have just seen the Componentwise smoothing spline base procedure smoothes the reponse against the one predictor variable which reduces RSS most we keep the degrees of freedom fixed for all candidate predictors, e.g. d.f. = 2.5

L2Boosting yields an additive model fit, including variable selection

15

slide-16
SLIDE 16

✬ ✫ ✩ ✪

Componentwise linear least squares simple linear OLS against the one predictor variable which reduces RSS most

ˆ θ(x) = ˆ β ˆ S x( ˆ S), ˆ βj = n

X

i=1 YiX(j) i / n

X

i=1 (X(j) i )2, ˆ S = arg min j n

X

i=1 (Yi − ˆ βj X(j) i )2

first round of estimation: selected predictor variable X ( ˆ

S1) (e.g. = X(3))

corresponding ˆ

β ˆ

S1 fitted function ˆ

f1(x)

second round of estimation: selected predictor variable X ( ˆ

S2) (e.g.= X(21))

corresponding ˆ

β ˆ

S2 fitted function ˆ

f2(x)

etc.

L2Boosting: ˆ fm(x) = ˆ fm−1(x) + ν · ˆ θ(x) L2Boosting yields linear model fit, including variable selection,

i.e. structured model fit

16

slide-17
SLIDE 17

✬ ✫ ✩ ✪

for ν = 1, this is known as Matching Pursuit (Mallat and Zhang, 1993) Weak greedy algorithm (deVore & Temlyakov, 1997) a version of Boosting (Schapire, 1992; Freund & Schapire, 1996) Gauss-Southwell algorithm C.F . Gauss in 1803 “Princeps Mathematicorum” R.V. Southwell in 1933 Professor in engineering, Oxford

17

slide-18
SLIDE 18

✬ ✫ ✩ ✪

binary lymph node classification in breast cancer using gene expressions: a high noise problem

n = 49 samples, p = 7129 gene expressions

L2Boosting

FPLR Pelora 1-NN DLDA SVM CV-misclassif.err. 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%

multivariate gene selection best 200 genes from Wilcox.

L2Boosting selected 42 out of p = 7129 genes

for this data-set: not good prediction, with any of the methods but L2Boosting may be a reasonable(?) multivariate gene selection method

18

slide-19
SLIDE 19

✬ ✫ ✩ ✪

42 (out of 7129) selected genes (n = 49)

10 20 30 40 −0.15 −0.10 −0.05 0.00 0.05

sorted regression coefficients

selected genes

identifiability problem: strong correlations among some genes

consider groups of highly correlated genes, biological categories (e.g. GO), ....

linear model: multivariate association between genes and tumor-type very different from 2-sample tests for individual genes

19

slide-20
SLIDE 20

✬ ✫ ✩ ✪

Pairwise smoothing splines smoothes response against the pair of predictor variables which reduces RSS most we keep the degrees of freedom fixed for all candidate pairs, e.g. d.f. = 2.5

L2Boosting yields a nonparametric interaction model, including variable

selection

20

slide-21
SLIDE 21

✬ ✫ ✩ ✪

Example: degree 2 nonparametric interaction modelling Friedman #1 model:

Y = 10 sin(πX1X2) + 20(X3 − 0.5)2 + 10X4 + 5X5 + N (0, 1), X = (X1, . . . , X20) ∼ Unif.([0, 1]20)

100 200 300 400 500 4 5 6 7

p=20, p−eff=10, n=50

boosting iterations MSE

MARS L2Boost

AIC_c stopped

L2Boosting with pairwise splines

sample size n = 50

p = 20, effective peff = 5

21

slide-22
SLIDE 22

✬ ✫ ✩ ✪

Regression trees stumps (2 terminal nodes): L2Boosting fits an additive model trees with d terminal nodes: L2Boosting fits an interaction model of degree d − 2

22

slide-23
SLIDE 23

✬ ✫ ✩ ✪

The low variance high bias “principle”

  • nce we have decided about some structural properties

choose base procedure with low variance but potentially large estimation bias bias can be reduced by further boosting iterations (which will increase variance) example: low degrees of freedom in componentwise smoothing splines for additive modeling a justification will be given later

23

slide-24
SLIDE 24

✬ ✫ ✩ ✪

  • 4. More on L2Boosting

L2Boosting for linear models

use componentwise linear least squares base procedure

L2Boosting converges to a least squares solution as boosting iterations m → ∞

(the unique LS solution if design has full rank p ≤ n) when stopping early:

  • it does variable selection
  • coefficient estimates are typically shrunken version of LS

“similar to” the Lasso

24

slide-25
SLIDE 25

✬ ✫ ✩ ✪

Connections to Lasso (for linear models): Efron, Hastie, Johnstone, Tibshirani (2004): for special design matrices, iterations of L2Boosting with “infinitesimally” small ν yield all Lasso solutions when varying λ

computationally interesting to produce all Lasso solutions in

  • ne sweep of boosting

Least Angle Regression LARS (Efron et al., 2004) is computationally even more clever and efficient than L2Boosting Zhao and Yu (2005): in “general”, when adding some backward steps the solutions from Lasso and modified Boosting “coincide” greedy (plus backward steps) and convex optimization are surprisingly similar

25

slide-26
SLIDE 26

✬ ✫ ✩ ✪ p = 10, peff = 3, n = 20

100 200 300 400 500 2 3 4 5 6

uncorrelated design

boosting iterations MSE AIC−stopped L2Boost Lasso fwd.var.sel. OLS 100 200 300 400 500 2 3 4 5 6

correlated design

boosting iterations MSE AIC−stopped L2Boost Lasso fwd.var.sel. OLS

26

slide-27
SLIDE 27

✬ ✫ ✩ ✪

binary lymph node classification using gene expressions

n = 49 samples, p = 7129 gene expressions

L2Boosting

FPLR Pelora 1-NN DLDA SVM CV-misclassif.err. 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% Lasso CV-misclassif.err. 21.2%

multivariate gene selection best 200 genes from Wilcox.

L2Boosting selected 42 out of p = 7129 genes

Lasso selected 15 genes

27

slide-28
SLIDE 28

✬ ✫ ✩ ✪

how well can we do? statistically consistent for very high-dimensional, sparse linear models

Yi = β0 +

p

  • j=1

βjX(j)

i

+ εi (i = 1, . . . , n), p ≫ n

Theorem (PB, 2004)

L2Boosting with comp. linear LS is consistent (for suitable number of boosting

iterations) if:

  • pn = O(exp(Cn1−ξ)) (0 < ξ < 1)

(high-dimensional) essentially exponentially many variables relative to n

  • supn

pn

j=1 |βj,n| < ∞ ℓ1-sparseness of true function

i.e. for suitable, slowly growing m = mn:

I EX| ˆ fmn,n(X) − fn(X)|2 = oP (1) (n → ∞)

“no” assumptions about the predictor variables/design matrix

28

slide-29
SLIDE 29

✬ ✫ ✩ ✪

analogous results also for

  • multivariate regression
  • vector autoregressive time series

(Lutz & PB, 2005)

29

slide-30
SLIDE 30

✬ ✫ ✩ ✪

For linear models: L2Boosting or Lasso?

  • “similar” prediction performance
  • LARS algorithm is computationally more efficient (for all Lasso solutions)

O(np min(n, p)) for LARS; O(npmstop) for L2Boosting

  • notion of degrees of freedom is easier for L2Boosting
  • boosting is more generic (nonparametric models, other loss functions,...)

30

slide-31
SLIDE 31

✬ ✫ ✩ ✪ 4.1. Degrees of freedom for boosting

(PB, 2004) the only tuning parameter: number of boosting iterations could use cross-validation works reasonably well alternatively: use AIC, BIC or gMDL as model selection criteria which involve degrees of freedom of boosting

31

slide-32
SLIDE 32

✬ ✫ ✩ ✪

hat-matrix of comp.wise linear LS base procedure:

H(j) : (Y1, . . . , Yn) → ( ˆ Y1, . . . , ˆ Yn) when using the jth predictor variable only: H(j) = X(j)(X(j))T /X(j)2 L2Boosting hat-matrix: Bm = Bm−1 + ν · H( ˆ

Sm)(I − Bm−1)

= I − (I − ν · H( ˆ

Sm)

selected in mth iter.

)(I − ν · H( ˆ

Sm−1)) · · · (I − ν · H( ˆ S1))

degrees of freedom of boosting in iteration m:

d.f.(Bm) = trace(Bm) d.f. ignores the selection effect, i.e. “slightly” too small

(“negligible” (?) since we can allow for o(exp(n)) candidate basis functions)

32

slide-33
SLIDE 33

✬ ✫ ✩ ✪ d.f. is very different from the number of variables in the model

example: 3 (or more) correlated variables, ν = 1 sequence of selected variables: 3,2,1,3,2,1 d.f.(B6) = 1.79 < 3 sequence of selected variables: 1,2,3,2,3,1 d.f.(B6) = 1.54 < 3

33

slide-34
SLIDE 34

✬ ✫ ✩ ✪

Stopping the boosting iterations we often use the corrected AICc criterion: AICc(Bm) = log(RSSm/n) +

1 + trace(Bm)/n 1 − (trace(Bm) + 2)/n

estimate stopping iteration by

ˆ mstop = argminm AICc(Bm)

34

slide-35
SLIDE 35

✬ ✫ ✩ ✪ p = 10, peff = 3, n = 20

100 200 300 400 500 2 3 4 5 6

uncorrelated design

boosting iterations MSE AIC−stopped L2Boost Lasso fwd.var.sel. OLS 100 200 300 400 500 2 3 4 5 6

correlated design

boosting iterations MSE AIC−stopped L2Boost Lasso fwd.var.sel. OLS

35

slide-36
SLIDE 36

✬ ✫ ✩ ✪

Analogously for nonparametric base procedures hat-matrix H(S) with a selected subset S of predictor variables

Bm = I − (I − ν · H( ˆ

Sm))(I − ν · H( ˆ Sm−1)) · · · (I − ν · H( ˆ S1))

e.g. L2Boosting with pairwise splines for nonparametric interaction modeling

100 200 300 400 500 4 5 6 7

p=20, p−eff=10, n=50

boosting iterations MSE

MARS L2Boost

AIC_c stopped

36

slide-37
SLIDE 37

✬ ✫ ✩ ✪

More on degrees of freedom example: L2Boosting with componentwise smoothing splines for additive modeling boosting hat-matrix Bm: since ˆ

f(Xi) = p

j=1 ˆ

fj(Xi) decompose Bm =

p

  • j=1

A(j)

m

  • hat-matrix for ˆ

fj(·)

easy to compute recursively:

A(j)

m = A(j) m−1 + δj, ˆ Smν · H( ˆ Sm)(I − Bm−1)

thus

d.f.

  • trace(Bm)

=

p

  • j=1

d.f.(j)

trace(A(j)

m )

37

slide-38
SLIDE 38

✬ ✫ ✩ ✪ Y = 10

j=1 gj(X(j))+ε, X ∼ Unif[0, 1]100; n = 200, p = 100, peff = 10

0.0 0.4 0.8 −4 −2 2 4

df=3.5

predictor 0.0 0.4 0.8 −4 −2 2 4

df=2.7

predictor 0.0 0.4 0.8 −4 −2 2 4

df=0

predictor 0.0 0.4 0.8 −4 −2 2 4

df=2.6

predictor 0.0 0.4 0.8 −4 −2 2 4

df=4.9

predictor 0.0 0.4 0.8 −4 −2 2 4

df=5.3

predictor 0.0 0.4 0.8 −4 −2 2 4

df=6.9

predictor 0.0 0.4 0.8 −4 −2 2 4

df=8.2

predictor 0.0 0.4 0.8 −4 −2 2 4

df=6.3

predictor 0.0 0.4 0.8 −4 −2 2 4

df=6.4

predictor 0.0 0.4 0.8 −4 −2 2 4

df=0.9

predictor 0.0 0.4 0.8 −4 −2 2 4

df=2.1

predictor

L2Boosting does a “very reasonable” assignment of degrees of freedom

38

slide-39
SLIDE 39

✬ ✫ ✩ ✪

a very interesting way to search and estimate in high dimensions! with classical methods (backfitting) for large p: “infeasible” to do variable selection and variable amount of d.f.

L2Boosting runs with one (!) tuning parameter

39

slide-40
SLIDE 40

✬ ✫ ✩ ✪

for standard errors in additive modelling

s.e.( ˆ fj(Xi)) =

  • σ2

ε(

A(j)

m

  • hat matrix for jth comp.

(A(j)

m )T )ii

in our experience: seems quite OK maybe slightly too small becuase we ignore the selection effect for comparing models: use AIC, BIC, gMDL, etc.

40

slide-41
SLIDE 41

✬ ✫ ✩ ✪ 4.2. The MSE curve and asymptotic optimality

toy example: L2Boosting with smoothing spline for p = 1-dimensional predictor

boosting

m generalization squared error 50 100 150 200 0.2 0.4 0.6 0.8

varying df

degrees of freedom generalization squared error 10 20 30 40 0.2 0.4 0.6 0.8

sub-linear increase of MSE in Boosting

L2Boosting quite resistant against overfitting; “easy to tune”

41

slide-42
SLIDE 42

✬ ✫ ✩ ✪

consider (any) base procedure as operator:

H : Y = (Y1, . . . , Yn)′

base procedure

→ ˆ Y = ( ˆ Y1, . . . , ˆ Yn)′ L2Boosting operator in iteration m: Bm = I − (I − H)m

if H is strictly shrinking, i.e. I − S < 1

L2Boosting converges to identity I (fully saturated model) need for early stopping

42

slide-43
SLIDE 43

✬ ✫ ✩ ✪

in case where H is a smoothing spline:

L2Boosting does shrinkage in the same eigenspace as the smoothing spline H

eigenvalues of smoothing spline:

λ1 = λ2 = 1, 0 < λi < 1 (i = 3, . . . , n)

eigenvalues of L2Boosting:

ev1 = 1, ev2 = 1, 0 < evi = 1 − (1 − λi)m (i = 3, . . . , n)

change these eigenvalues (spectrum) by varying the iteration number m

tuning via m leads to sublinear increase of MSE w.r.t. m

43

slide-44
SLIDE 44

✬ ✫ ✩ ✪

Theorem (PB & Yu, 2003)

L2Boosting with smoothing splines having any fixed deg. of freedom (“low

variance”)

  • when stopping iterations suitably, it achieves asymptotically the
  • ptimal minimax MSE rate (over Sobolev space)
  • it adapts to unknown greater smoothness of underlying function

(adaptation to optimal MSE rate) e.g. L2Boost with cubic smoothing splines automatically achieves faster rate than O(n−4/5) if underlying function is smooth

44

slide-45
SLIDE 45

✬ ✫ ✩ ✪

Summary about (L2-)Boosting

  • need for early stopping

“obvious” but has been still debated in 2000

  • choose the base procedure to obtain the qualitative model fit of your own “choice”

having decided on structure: use low variance and high estimation bias “principle”

  • reasonable degrees of freedom and hat-matrices can be easily derived

for L2Boosting with base proc. involving linear fitting after selection of variables

  • non-linear boosting algo.

all this applies also to boosting with other loss functions

45

slide-46
SLIDE 46

✬ ✫ ✩ ✪

  • 5. Boosting for binary classification

binary lymph node classification using gene expressions: data

(Xi, Yi), Xi ∈ R7129, Yi ∈ {−1, 1}

Various loss functions

ρ(y, f) = log2(1 + exp(−yf)): negative binomial log-likelihood f ∗(x) = log(

p(x) 1−p(x))

ρ(y, f) = |y − f|2 = 1 − 2yf + (yf)2: squared error f ∗(x) = I E[Y |X = x] = 2p(x) − 1 ρ(y, f) = exp(−yf): exponential loss in AdaBoost f ∗(x) = 1

2 log( p(x) 1−p(x))

ρ(y, f) = I 1[yf<0]: misclassification loss f ∗(x) = I 1[p(x)≥1/2]

46

slide-47
SLIDE 47

✬ ✫ ✩ ✪

all these loss functions: ρ(y, f) = ρ(yf): function of the margin value yf

−3 −2 −1 1 2 3 1 2 3 4 5 6 7

monotone

yf loss exp log−lik. SVM 0−1 −3 −2 −1 1 2 3 1 2 3 4 5 6 7

non−monotone

yf loss L2 L1 0−1

minimization of the non-convex misclassification loss: computationally infeasible

  • ther loss functions: convex surrogate loss functions, dominating misclass. error

47

slide-48
SLIDE 48

✬ ✫ ✩ ✪

Buja, Stuetzle and Shen (2005): all these surrogate loss functions are “proper” almost no difference from asymptotic point of view my favourite: log-likelihood

  • monotone
  • approximately linear for large negative values yf

48

slide-49
SLIDE 49

✬ ✫ ✩ ✪ 5.1. LogitBoost

(Friedman, Hastie & Tibshirani, 2000) algorithm: FGD with negative log-likelihood and Hessian instaed of line-search

iterative weighted LS fitting: in iteration m, n−1

n

  • i=1

wi

  • ˆ

pm−1(Xi)(1−ˆ pm−1(Xi))

( Yi − ˆ pm−1(Xi) ˆ pm−1(Xi)(1 − ˆ pm−1(Xi)) − θ(Xi))2

since f ∗(x) = log(

p(x) 1−p(x)) ˆ

fm(·) is an estimate of the log-odds ratio

examples:

  • componentwise weighted linear LS: logistic linear model fit
  • weighted componentwise smoothing splines: logistic additive model fit
  • weighted stumps: logistic additive model fit

works quite nicely for high-dimensional logistic linear or additive or low-order interaction models

49

slide-50
SLIDE 50

✬ ✫ ✩ ✪

  • 6. Boosting in survival analysis

acute myeloid leukemia (AML) study from Bullinger et al., 2004: survival times of n = 116 patient; 68 died during the study period

p = 155 predictors: 8 clinical variables, 147 gene expression levels

full data: survival time Ti ∈ R+, predictor Xi ∈ Rp we use here Yi = log(Ti) full data loss function: ρ(y, f) = (y − f)2

  • bserved data:

Oi = ( ˜ Yi, Xi, ∆i), ˜ Yi = log( ˜ Ti), ˜ Ti = min(Ti, Ci)

censoring indicator ∆i = I

1[Ti≤Ci]

assume: censoring time Ci conditionally independent of Ti given Xi

coarsening at random assumption holds

50

slide-51
SLIDE 51

✬ ✫ ✩ ✪

inverse probability censoring weights and observed data loss: define observed data loss

ρobs(o, f) = (˜ y − f)2∆ · 1 G(˜ t|x)

inverse probability: G(c|x)=I

P[C>c|X=x]

then (van der Laan & Robins, 2003):

I EY,X[(Y − f(X))2] = I EO[ρobs(O, f)]

strategy: estimate G(·|x) e.g. by Kaplan-Meier and do boosting on weighted squared error loss:

n

  • i=1

∆i 1 ˆ G( ˜ Ti|Xi)

  • weight wi

( ˜ Yi

  • log(min(Ci,Ti))

−f(Xi))2

51

slide-52
SLIDE 52

✬ ✫ ✩ ✪

we did componentwise weighted linear least squares

linear fit of the regression function f(·)

M: location model; RF: random forest for survival data; L2B: L2Boosting; cRF: RF with 8 clinincal variables only; cL2B: L2B with 8 clinical variables only

52

slide-53
SLIDE 53

✬ ✫ ✩ ✪

not possible to do the Henderson et al. (2001) loss:

ρ(T, f) = 1 − I 1[T/2≤f≤2T ] ⇔ ρ(y, f) = I 1[|y−f|>log(2)]

which is non-convex...!

53

slide-54
SLIDE 54

✬ ✫ ✩ ✪

in many real applications: main interest is finding the relevant variables (and prediction is of “minor” importance)

  • tumor classification based on gene expression: which genes are important?
  • Bullinger et al. survival study: which genes and variables are important?
  • riboflavin concentration (vitamin B2) produced by Bacillus subtilis

which genes are important? (in collaboration with DSM)

54

slide-55
SLIDE 55

✬ ✫ ✩ ✪

  • 7. Variable selection and additional sparsity

is boosting a good variable selection method? The analogy with the Lasso for linear models consider again linear model (or highly overcomplete dictionary)

Y = f(X) + ε, f(x) =

p

  • j=1

βjx(j), p ≫ n

Lasso or ℓ1-penalized regression (Tibshirani, 1996):

ˆ βLasso = argminβn−1

n

  • i=1

(Yi −

p

  • j=1

βjX(j)

i

)2 + λ

  • ≥0; penalty par.

p

  • j=1

|βj|

55

slide-56
SLIDE 56

✬ ✫ ✩ ✪

Lasso:

  • does variable selection: some (many) ˆ

βj’s exactly equal to 0

  • does shrinkage
  • involves a convex optimization only

(instead of exhaustively checking 2p sub-models)

56

slide-57
SLIDE 57

✬ ✫ ✩ ✪

Some theory for high dimensions Theorem (Meinshausen & PB, 2004) For λn ∼ Cn−1/2+δ/2,

I P[estimated sub-model(λn) = true model] = 1 − O(exp(−Cnδ)) (n → ∞)

(0 < δ < 1)

if

  • Gaussian data
  • p = pn = O(nr) for any r > 0 (high-dimensional)
  • number of effective variables peff = O(nk) (0 < k < 1) (sparseness)
  • plus some other technical conditions

justification for relaxation with a computationally simple convex problem!

57

slide-58
SLIDE 58

✬ ✫ ✩ ✪

Choice of λ Theorem doesn’t say much about choosing λ... first (not so good) idea: choose λ to optimize prediction e.g. via some cross-validation scheme but: for prediction oracle solution

λ∗ = arg min

λ

I E[(Y −

p

  • j=1

ˆ βj(λ)X(j))2] I P[estimated sub-model(λ∗) = true model] → 0 (pn → ∞, n → ∞)

asymptotically: the prediction optimal graph is too large (Meinshausen & PB, 2004; related example by Meng et al., 2004)

58

slide-59
SLIDE 59

✬ ✫ ✩ ✪

reason: need large λ for variable selection strong bias/strong shrinkage for orthogonal design: strong bias in soft-thresholding

−3 −2 −1 1 2 3 −2 −1 1 2

threshold functions

z hard−thresholding nn−garrote soft−thresholding

Better:

  • SCAD (Fan and Li, 2001)
  • Nonnegative Garrote (Breiman, 1995)
  • Bridge estimation

(Frank and Friedman, 1993) they all work for general X for non-orthogonal X:

  • non-convex optimization for SCAD or Bridge estimation
  • NN-Garrote only for p ≤ n

59

slide-60
SLIDE 60

✬ ✫ ✩ ✪

The good message Lasso produces a set of sub-models

M1 ⊂ . . . ⊂ . . . Mpred−opt

  • ptimal for prediction with Lasso

⊂ . . . ⊂ MN

with N = O(min(n, p)) and Mtrue is with probability 1 − O(exp(−Cnδ)) among these models but Mtrue = Mpred−opt Solutions using this “good message”:

  • relaxed Lasso (Meinshausen, 2005)

a second round of Lasso on selected sub-models but surprisingly: computationally no need to do a second round of Lasso fitting

  • BIC-scoring for selected submodels (?)

60

slide-61
SLIDE 61

✬ ✫ ✩ ✪

  • 8. SparseL2Boosting

(PB and Yu, 2005) instead of minimizing RSS in every iteration, minimize a final prediction error (FPE) criterion: we propose gMDL,

ˆ θm = arg min

θ(·) n

  • i=1

(Yi − ˆ fm−1(Xi) − θ(Xi))2+ gMDL-penalty

  • r AIC, BIC,...

another use of degrees of freedom

61

slide-62
SLIDE 62

✬ ✫ ✩ ✪

Theorem (PB & Yu, 2005) for orthonormal linear model: SparseL2Boosting with componentwise linear least squares yields Breiman’s nonnegative garrote estimator

−3 −2 −1 1 2 3 −2 −1 1 2

threshold functions

z hard−thresholding nn−garrote soft−thresholding

  • SparseL2Boosting yields sparser solutions than L2Boosting
  • SparseL2Boosting still very generic (although less generic than L2Boosting)

e.g. nonparametric problems, non-quadratic loss functions

62

slide-63
SLIDE 63

✬ ✫ ✩ ✪

Linear modeling: L2Boosting with componentwise linear LS sample size n = 50, dimension p = 50 model SparseL2Boosting

L2Boosting

Y = 1 + 5X(1) + 2X(2) + X(3) + N (0, 1) X = (X(1), . . . , X(49)) ∼ N49(0, I) MSE 0.16 (0.0018) 0.46 (0.0041)

I E[no. of seleccted variables]

5 13.68 Y = P50

j=1 βjX(j) + N (0, 1)

β1, . . . , β50 ∼ Double-Exponential; X as above MSE 3.64 (0.188) 2.19 (0.083)

63

slide-64
SLIDE 64

✬ ✫ ✩ ✪

Nonparametric first-order interaction modeling

100 200 300 400 500 2 3 4 5 6 7

interaction modelling: p = 20, effective p = 5

boosting iterations MSE L2Boosting SparseL2Boosting MARS

Friedman #1 model:

Y = 10 sin(πX1X2) + 20(X3 − 0.5)2 + 10X4 + 5X5 + N (0, 1) X = (X1, . . . , X20) ∼ Unif.([0, 1]20)

Sample size n = 50 Dimension p = 20, peff = 5

64

slide-65
SLIDE 65

✬ ✫ ✩ ✪

Riboflavin concentration in bacillus subtilis

Yi ∈ R: log-concentration of ribolflavin Xi ∈ R6839: p = 6939 gene expressions

sample size n = 89

−2 −1 1 2 3 −10 −8 −6 −4 −2 log−expression log−concentration 4.5 5.0 5.5 6.0 6.5 7.0 7.5 −10 −8 −6 −4 −2 log−expression log−concentration 4 5 6 7 8 −10 −8 −6 −4 −2 log−expression log−concentration 3 4 5 6 7 8 −10 −8 −6 −4 −2 log−expression log−concentration

L2Boosting with componentwise linear least squares: selected 41 genes

SparseL2Boosting with comonentwise linear least squares: selected 21 genes 15 genes are in common note the identifiability problem due to high correlations among genes! quite a few other measurements are available for this dataset...

65

slide-66
SLIDE 66

✬ ✫ ✩ ✪

  • 9. Conclusions

statistical view of boosting: a regularization method for estimation and variable selection mainly useful for high-dimensional data problems

  • boosting is very generic
  • boosting is computationally attractive: complexity O(p) for p ≫ n
  • simple statistical inference is possible, but more needs to be done

66