Robust strategies and model selection Stefan Van Aelst Department - - PowerPoint PPT Presentation

robust strategies and model selection stefan van aelst
SMART_READER_LITE
LIVE PREVIEW

Robust strategies and model selection Stefan Van Aelst Department - - PowerPoint PPT Presentation

Robust strategies and model selection Stefan Van Aelst Department of Applied Mathematics and Computer Science Ghent University, Belgium Stefan.VanAelst@UGent.be ERCIM09 - COMISEF/COST Tutorial Outline Regression model 1 Least squares 2


slide-1
SLIDE 1

Robust strategies and model selection Stefan Van Aelst

Department of Applied Mathematics and Computer Science Ghent University, Belgium Stefan.VanAelst@UGent.be ERCIM09 - COMISEF/COST Tutorial

slide-2
SLIDE 2

Outline

1

Regression model

2

Least squares

3

Manual variable selection approach

4

Automatic variable selection approach

5

Robustness

6

Robust variable selection: sequencing

7

Robust variable selection: segmentation

Robust selection procedures Stefan Van Aelst 2

slide-3
SLIDE 3

Regression model

Regression setting

Consider a dataset Zn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1. Y is the response variable X1, . . . , Xd are the candidate regressors The corresponding linear model is: yi = β1xi1 + · · · + βdxid + ǫi i = 1, . . . , n yi = x′

iβ + ǫi

i = 1, . . . , n where the errors ǫi are assumed to be iid with E(ǫi) = 0 and Var(ǫi) = σ2 > 0. Estimate the regression coefficients β from the data.

Robust selection procedures Stefan Van Aelst 3

slide-4
SLIDE 4

Least squares

Least squares solution

ˆ βLS solves min

β n

  • i=1
  • yi − x′

2 Write X = (x1, . . . , xn)t y = (y1, . . . , xn)t Then, ˆ βLS solves min

β (y − Xβ)t(y − Xβ)

⇒ ˆ βLS = (XtX)−1Xty ˆ y = Xˆ β = X(XtX)−1Xty = Hy

Robust selection procedures Stefan Van Aelst 4

slide-5
SLIDE 5

Least squares

Least squares properties

Unbiased estimator: E(ˆ βLS) = β Gauss-Markov theorem: LS has smallest variance among all unbiased linear estimators of β.

Why do variable selection?

Robust selection procedures Stefan Van Aelst 5

slide-6
SLIDE 6

Least squares

Expected prediction error

Assume the true regression function is linear:

Y|x = f(x) + ǫ = xtβ + ǫ

Predict the response Y0 at x0: Y0 = xt

0β + ǫ0 = f(x0) + ǫ0

Use an estimator of the regression coefficients: ˜ β Estimated prediction: ˜ f(x0) = xt

β Expected prediction error: E

  • (Y0 − ˜

f(x0))2

Robust selection procedures Stefan Van Aelst 6

slide-7
SLIDE 7

Least squares

Expected prediction error

E[(Y0 − ˜ f(x0))2] = E

  • (f(x0) + ǫ0 − ˜

f(x0))2 = σ2 + E

  • (f(x0) − ˜

f(x0))2 = σ2 + MSE(˜ f(x0)) σ2: irreducible variance of the new observation y0 MSE(˜ f(x0)) mean squared error of the prediction at x0 by the estimator ˜ f

Robust selection procedures Stefan Van Aelst 7

slide-8
SLIDE 8

Least squares

MSE of a prediction

MSE(˜ f(x0)) = E

  • (f(x0) − ˜

f(x0))2 = E

  • [xt

0(β − ˜

β)]2 = E

  • [xt

0(β − E(˜

β) + E(˜ β) − ˜ β)]2 = bias(˜ f(x0))2 + Var(˜ f(x0)) LS is unbiased ⇒ bias(˜ f(x0)) = 0 LS minimizes Var(˜ f(x0)) (Gauss-Markov) LS has smallest MSPE among all linear unbiased estimators

Robust selection procedures Stefan Van Aelst 8

slide-9
SLIDE 9

Least squares

LS instability

LS becomes unstable with large MSPE if Var(˜ f(x0)) is high. This can happen if Many noise variables among the candidate regressors Highly correlated predictors (multicollinearity) ⇒ Improve on least squares MSPE by trading (a little) bias for (a lot of) variance!

Robust selection procedures Stefan Van Aelst 9

slide-10
SLIDE 10

Manual variable selection approach

Manual variable selection

Try to determine the set of the most important regressors Remove the noise regressors from the model Avoid multicollinearity Methods All subsets Backward elimination Forward selection Stepwise selection → choose a selection criterion

Robust selection procedures Stefan Van Aelst 10

slide-11
SLIDE 11

Manual variable selection approach

Submodels

Dataset Zn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1. Let α ⊂ {1, . . . , d} denote the predictors included in a submodel The corresponding submodel is: yi = x′

αiβα + ǫαi

i = 1, . . . , n. A selected model is considered a good model if It is parsimonious It fits the data well It yields good predictions for similar data

Robust selection procedures Stefan Van Aelst 11

slide-12
SLIDE 12

Manual variable selection approach

Some standard selection criteria

Adjusted R2: A(α) = 1 − RSS(α)/(n − d(α)) RSS(1)/(n − 1) Mallow’s Cp: C(α) = RSS(α) ˆ σ2 − (n − 2d(α)) Final Prediction Error: FPE(α) = RSS(α) ˆ σ2 + 2d(α) AIC: AIC(α) = −2L(α) + 2d(α) BIC: BIC(α) = −2L(α) + log(n)d(α) where ˆ σ is the residual scale estimate in the "full" model

Robust selection procedures Stefan Van Aelst 12

slide-13
SLIDE 13

Manual variable selection approach

Resampling based selection criteria

Consider the (conditional) expected prediction error: PE(α) = E

  • 1

n

n

  • i=1
  • zi − x′

αiˆ

βα 2

  • y, X
  • ,

Estimates of the PE can be used as selection criterion. Estimates can be obtained by cross-validation or bootstrap. A more advanced selection criterion takes both goodness-of-fit and PE into account: PPE(α) = 1 n

n

  • i=1
  • yi − x′

αiˆ

βα 2 +f(n) d(α)+ˆ E

  • 1

n

n

  • i=1
  • zi − x′

αiˆ

βα 2

  • y, X
  • Robust selection procedures

Stefan Van Aelst 13

slide-14
SLIDE 14

Automatic variable selection approach

Automatic variable selection

Try to find a stable model that fits the data well Shrinkage: constrained least squares optimization Stagewise forward procedures Methods Ridge regression Lasso Least Angle regression L2 Boosting Elastic Net

Robust selection procedures Stefan Van Aelst 14

slide-15
SLIDE 15

Automatic variable selection approach

Lasso

Least Absolute Shrinkage and Selection Operator ˆ βlasso = arg min

β n

  • i=1

 yi − β0 −

d

  • j=1

βjxij  

2

subject to β1 =

d

  • j=1

|βj| ≤ t 0 < t < ˆ βLS1 is a tuning parameter

Robust selection procedures Stefan Van Aelst 15

slide-16
SLIDE 16

Automatic variable selection approach

Example: LASSO fits

* * * * * * * * * 2 4 6 8 −2 2 4 6 Df Standardized Coefficients * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

LASSO

6 3 7 4 8 1 Robust selection procedures Stefan Van Aelst 16

slide-17
SLIDE 17

Automatic variable selection approach

Least angle regression

Standardize the variables.

1 Select x1 such that |cor(y, x1)| = maxj |cor(y, xj)|. 2 Put r = y − γx1 where γ is determined such that

|cor(r, x1)| = max

j=1 |cor(r, xj)|. 3 Select x2 corresponding to the maximum above.

Determine the equiangular direction b such that x′

1b = x′ 2b 4 Put r = r − γb where γ is determined such that

|cor(r, x1)| = |cor(r, x2)| = max

j=1,2 |cor(r, xj)|. 5 Continue the procedure . . .

Robust selection procedures Stefan Van Aelst 17

slide-18
SLIDE 18

Automatic variable selection approach

Properties of LAR

Least angle regression (LAR) selects the predictors in

  • rder of importance.

LAR changes the contributions of the predictors gradually as they are needed. LAR is very similar to LASSO and can easily be adjusted to produce the LASSO solution LAR only uses the means, variances and correlations of the variables. LAR is computationally as efficient as LS

Robust selection procedures Stefan Van Aelst 18

slide-19
SLIDE 19

Automatic variable selection approach

Example: LAR fits

* * * * * * * * * 2 4 6 8 −0.2 0.0 0.2 0.4 0.6 Df Standardized Coefficients * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

LAR

6 3 7 4 2 1 Robust selection procedures Stefan Van Aelst 19

slide-20
SLIDE 20

Automatic variable selection approach

L2 boosting

Standardize the variables.

1 Put r = y and ˆ

F0 = 0

2 Select x1 such that |cor(r, x1)| = maxj |cor(r, xj)|. 3 Update r = y − νˆ

f(x1) where 0 < ν ≤ 1 is the step length and ˆ f(x1) are the fitted values from the LS regression of y

  • n x1.

Similarly, update ˆ F1 = ˆ F0 + νˆ f(x1)

4 Continue the procedure . . .

Robust selection procedures Stefan Van Aelst 20

slide-21
SLIDE 21

Automatic variable selection approach

Sequencing variables

Several selection algorithms sequence the predictors in "order

  • f importance" or screen out the most relevant variables

Forward/stepwise selection Stagewise forward selection Penalty methods Least angle regression L2 boosting These methods are computationally very efficient because they are only based on means, variances and correlations.

Robust selection procedures Stefan Van Aelst 21

slide-22
SLIDE 22

Robustness

Robustness: Data with outliers

Question: Number of partners men and women desire to have in the next 30 years? Men: Mean=64.3, Median=1 − → Mean is sensitive to outliers − → Median is robust and thus more reliable

Robust selection procedures Stefan Van Aelst 22

slide-23
SLIDE 23

Robustness

Least squares regression

3.6 3.8 4.0 4.2 4.4 4.6 4.0 4.5 5.0 5.5 Log Surface Temperature Log Light Intensity

LS

LS: Minimize

  • r2

i (β)

Robust selection procedures Stefan Van Aelst 23

slide-24
SLIDE 24

Robustness

Outliers

3.6 3.8 4.0 4.2 4.4 4.6 4.0 4.5 5.0 5.5 6.0 Log Surface Temperature Log Light Intensity

LS

Outliers attract LS!

Robust selection procedures Stefan Van Aelst 24

slide-25
SLIDE 25

Robustness

Robust regression estimators

3.6 3.8 4.0 4.2 4.4 4.6 4.0 4.5 5.0 5.5 6.0 Log Surface Temperature Log Light Intensity

LS MM

Robust MM estimator is less influenced by outliers!

Robust selection procedures Stefan Van Aelst 25

slide-26
SLIDE 26

Robustness

Robust univariate location estimators

The sample mean ¯ Xn satisfies the equation

n

  • i=1

(Xi − ¯ Xn) = 0 The ML estimator ˆ θ solves the equation

n

  • i=1

∂ ∂θ log fθ(Xi)|θ=ˆ

θ = 0

For a suitable score function ψ(x, θ), the M-estimator Tn solves the equation

n

  • i=1

ψ(Xi − Tn) = 0

Robust selection procedures Stefan Van Aelst 26

slide-27
SLIDE 27

Robustness

Univariate location M-estimators

✓ ✒ ✏ ✑

n

  • i=1

ψ(Xi − Tn) = 0 Consistent if

  • ψ(y)dF(y) = EF(ψ(y)) = 0

Asymptotic efficiency: (

  • ψ′dΦ)2
  • ψ2dΦ

Robustness: Maximal breakdown point (50%) if ψ(y) is bounded!

Robust selection procedures Stefan Van Aelst 27

slide-28
SLIDE 28

Robustness

Examples of M-estimators

Sample mean: ψ(t) = t: Unbounded! Efficiency: 100% Median: ψ(t) = sign(t): Bounded, efficiency: 63.7% Huber estimator: ψb(t) = min{b, max{t, −b}} =    t if |t| ≤ b sign(t) b if |t| ≥ b with b > 0

Robust selection procedures Stefan Van Aelst 28

slide-29
SLIDE 29

Robustness

Huber psi function

−4 −2 2 4 −4 −2 2 4 x Huber psi b −b Robust selection procedures Stefan Van Aelst 29

slide-30
SLIDE 30

Robustness

Tuning the Huber M-estimator

Huber M-estimator has maximal breakdown point for any b < ∞ → b can be chosen for good efficiency at Φ b = 1.37 yields 95% efficiency → trade-off between robustness and efficiency!

Robust selection procedures Stefan Van Aelst 30

slide-31
SLIDE 31

Robustness

Example: Copper content in flour

Copper content (parts per million) in 24 wholemeal flour samples

5 10 15 20 25 30 Robust selection procedures Stefan Van Aelst 31

slide-32
SLIDE 32

Robustness

Example: Copper content in flour

Copper content (parts per million) in 24 wholemeal flour samples Sample mean: 4.28 Sample median: 3.39 Huber M-estimator: 3.21

Robust selection procedures Stefan Van Aelst 32

slide-33
SLIDE 33

Robustness

Monotone M-estimates

Huber M-estimator has a monotone psi-function If the function ψ(t) is monotone, then Equation n

i=1 ψ(Xi − Tn) = 0 has a unique solution

Tn is easy to compute Tn has maximal breakdown point Large outliers still affect the estimate (although the effect remains bounded)

Robust selection procedures Stefan Van Aelst 33

slide-34
SLIDE 34

Robustness

Redescending M-estimates

If the function ψ(t) is not monotone, but redescends to zero, then Equation n

i=1 ψ(Xi − Tn) = 0 has multiple solutions

Define ρ(t) such that ρ′(t) = ψ(t), then we need the solution min

Tn n

  • i=1

ρ(Xi − Tn) Tn can be more difficult to compute Tn has maximal breakdown point The effect of large outliers on the estimate reduces to zero! Increased robustness against large outliers

Robust selection procedures Stefan Van Aelst 34

slide-35
SLIDE 35

Robustness

Redescending M-estimates

A popular family of redescending loss functions is the Tukey biweight (bisquare) family of loss functions: ρc(t) =   

t2 2 − t4 2c2 + t6 6c4

if |t| ≤ c

c2 6

if |t| ≥ c. The constant c can be tuned for efficiency

Robust selection procedures Stefan Van Aelst 35

slide-36
SLIDE 36

Robustness

Tukey biweight ρ functions

−4 −2 2 4 0.0 0.5 1.0 1.5 2.0 t ρ(t)

c=3 c=2 c=∞

Robust selection procedures Stefan Van Aelst 36

slide-37
SLIDE 37

Robustness

Tukey biweight ψ function

−6 −4 −2 2 4 6 −4 −2 2 4 x Psi function b −b c −c

Huber Tukey

Robust selection procedures Stefan Van Aelst 37

slide-38
SLIDE 38

Robustness

Example: Copper content in flour

Copper content (parts per million) in 24 wholemeal flour samples Sample mean: 4.28 Sample median: 3.39 Huber M-estimator: 3.21 Tukey biweight M-estimator: 3.16

Robust selection procedures Stefan Van Aelst 38

slide-39
SLIDE 39

Robustness

Univariate scale estimators

Example: Copper content (parts per million) in 24 wholemeal flour samples Standard deviation: 5.30 Median absolute deviation (MAD): Sn = 1.483 med(|Xi − med(Xj)|) MAD: 0.53 − → Standard deviation is sensitive to outliers − → MAD is robust and thus more reliable

Robust selection procedures Stefan Van Aelst 39

slide-40
SLIDE 40

Robustness

M-estimators of scale

M-estimator of scale is the solution Sn such that

n

  • i=1

ψ(Xi/Sn) = 0 Symmetric distributions: use symmetric ψ functions Consistent if

  • ψ(y)dF(y) = EF(ψ(y)) = 0

The Tukey biweight loss functions ρc are symmetric Put b = EΦ(ρc) and define ψc(t) = ρc(t) − b, then the Tukey biweight M-estimator of scale Sn solves

n

  • i=1

ψc(Xi/Sn) = 0

  • r equivalently

1 n

n

  • i=1

ρc(Xi/Sn) = b

Robust selection procedures Stefan Van Aelst 40

slide-41
SLIDE 41

Robustness

Example: Copper content in flour

Copper content (parts per million) in 24 wholemeal flour samples Standard deviation: 5.30 Median absolute deviation: 0.53 Tukey biweight M-estimator: 0.66

Robust selection procedures Stefan Van Aelst 41

slide-42
SLIDE 42

Robustness

Robust regression

Denote ri(β) = yi − x′

iβ the residuals corresponding to β

ˆ βLS solves min

β n

  • i=1
  • yi − x′

2 =

n

  • i=1

(ri(β))2 Denote ˆ σ(β) = n

i=1(ri(β))2

n−d

the estimate of the residual scale The LS estimator ˆ βLS then equivalently solves min

β ˆ

σ(β) ⇒ Instead minimize a robust estimate of the residual scale

Robust selection procedures Stefan Van Aelst 42

slide-43
SLIDE 43

Robustness

Least Median of Squares regression

LS LMS Minimize 1 n − d

n

  • i=1

ri(β)2 − → Minimize med ri(β)2 Maximal breakdown point (50%) Small bias Slow rate of convergence (n−1/3) Inefficient

Robust selection procedures Stefan Van Aelst 43

slide-44
SLIDE 44

Robustness

Least Trimmed Squares regression

LS LTS Minimize 1 n − d

n

  • i=1

ri(β)2 − → Minimize 1 h

h

  • i=1

(r(β)2)i:n where (r(β)2)1:n ≤ · · · ≤ (r(β)2)n:n Breakdown point is min{h, n − h}/n ≤ 50% Asymptotically normal Trade-off robustness-efficiency Low efficiency (less than 10%)

Robust selection procedures Stefan Van Aelst 44

slide-45
SLIDE 45

Robustness

Regression S-estimators

LS S-estimate Minimize 1 n

n

  • i=1

ri(β)2 − → Minimize ˆ σ(β) For each β, ˆ σ(β) solves 1

n

ρc

  • ri(β)

σ

  • = b

c determines both robustness and efficiency Trade-off robustness-efficiency Breakdown point can be up to 50% Asymptotically normal Efficiency can still be low (less than 35%)

Robust selection procedures Stefan Van Aelst 45

slide-46
SLIDE 46

Robustness

Regression M-estimators

LS M-estimate Minimize

n

  • i=1

ri(β)2 − → Minimize

n

  • i=1

ρ ri(β) ˆ σ

  • r solve

n

  • i=1

ψ ri(β) ˆ σ

  • xi = 0

Requires a robust scale estimate ˆ σ!

Robust selection procedures Stefan Van Aelst 46

slide-47
SLIDE 47

Robustness

MM estimates

LS MM-estimate Minimize

n

  • i=1

ri(β)2 − → Minimize

n

  • i=1

ρ ri(β) ˆ σ

  • ˆ

σ is S-estimator’s M-scale M and S-estimator both use Tukey biweight ρc functions S-estimator is tuned for robustness (breakdown point) Redescending M-estimator is tuned for efficiency

Robust selection procedures Stefan Van Aelst 47

slide-48
SLIDE 48

Robustness

MM: loss functions

Tukey biweight family ρc(t) =

  • 3 t

c 2 − 3 t c 4 + t c 6

if |t| ≤ c 1 if |t| > c,

x loss −7 c0 c1 7 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ρ0 ρ1

ρ0 determines the breakdown point (S-estimator) ρ1 determines the efficiency (MM-estimator)

Robust selection procedures Stefan Van Aelst 48

slide-49
SLIDE 49

Robustness

MM estimates

LS MM-estimate Minimize

n

  • i=1

ri(β)2 − → Minimize

n

  • i=1

ρ ri(β) ˆ σ

  • ˆ

σ is S-estimator’s M-scale M and S-estimator both use Tukey biweight ρc functions S-estimator is tuned for robustness (breakdown point) Redescending M-estimator is tuned for efficiency Highly robust and efficient!

Robust selection procedures Stefan Van Aelst 49

slide-50
SLIDE 50

Robustness

Redescending psi function

⋆ A redescending psi function is needed for robustness, but this implies Multiple solutions of score equations Global solution is needed (high breakdown point) Difficult (time consuming) to compute

Robust selection procedures Stefan Van Aelst 50

slide-51
SLIDE 51

Robust variable selection: sequencing

Robust variable selection

Issues Robust regression estimators are computationally demanding ’Outliers’ depend on the model under consideration High dimensional data: Outlying cases? Our approach: a two-step procedure Sequencing: Construct a reduced sequence of good predictors in an efficient way. Segmentation: Build an optimal model from the reduced set of predictors.

Robust selection procedures Stefan Van Aelst 51

slide-52
SLIDE 52

Robust variable selection: sequencing

Sequencing the variables in order of importance

Automatic variable selection methods such as forward/stepwise selection, LAR and L2 boosting are computationally efficient methods to sequence predictors These methods are based only on the means, variances and correlations of the data. ⇒ Construct computationally efficient, robust methods to se- quence predictors by using computationally efficient and highly robust estimates of center, scale and correlation

Robust selection procedures Stefan Van Aelst 52

slide-53
SLIDE 53

Robust variable selection: sequencing

Robust building blocks

Location: Median Scatter: Median Absolute Deviation Correlation: Bivariate Winsorization Correlation: Bivariate M-estimators Correlation: Gnanadesikan-Kettenring estimators

Robust selection procedures Stefan Van Aelst 53

slide-54
SLIDE 54

Robust variable selection: sequencing

Winsorized correlation estimates

1 Robustly standardize the data using median and MAD 2 Transform the data by shifting outliers towards the center 3 Calculate the Pearson correlation of the transformed data

Robust selection procedures Stefan Van Aelst 54

slide-55
SLIDE 55

Robust variable selection: sequencing

Univariate Winsorization

Componentwise transformation u = ψc(x) = min(max(−c, x), c)

−4 −2 2 4 −4 −2 2 4 x

Huber ψ function with c=2

ψc(x)

Robust selection procedures Stefan Van Aelst 55

slide-56
SLIDE 56

Robust variable selection: sequencing

Univariate Winsorization

Componentwise transformation u = ψc(x) = min(max(−c, x), c)

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 56

slide-57
SLIDE 57

Robust variable selection: sequencing

Bivariate Winsorization

Bivariate transformation u = min(

  • c/D(x), 1)x with c = F−1

χ2

2 (0.95)

D(x) = xtR−1

0 x with R0 an initial bivariate correlation matrix.

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 57

slide-58
SLIDE 58

Robust variable selection: sequencing

Bivariate Winsorization

Bivariate transformation u = min(

  • c/D(x), 1)x with c = F−1

χ2

2 (0.95)

D(x) = xtR−1

0 x with R0 an initial bivariate correlation matrix.

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 58

slide-59
SLIDE 59

Robust variable selection: sequencing

Bivariate Winsorization

Bivariate transformation u = min(

  • c/D(x), 1)x with c = F−1

χ2

2 (0.95)

D(x) = xtR−1

0 x with R0 an initial bivariate correlation matrix.

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 59

slide-60
SLIDE 60

Robust variable selection: sequencing

Bivariate Winsorization

Bivariate transformation u = min(

  • c/D(x), 1)x with c = F−1

χ2

2 (0.95)

D(x) = xtR−1

0 x with R0 an initial bivariate correlation matrix.

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 60

slide-61
SLIDE 61

Robust variable selection: sequencing

Initial correlation estimate

Adjusted Winsorization: Univariate Winsorization with different tuning constants for different quadrants.

Denote h = ratio of observations in second and fourth quadrants to the observations in first and third quadrant. Suppose h ≤ 1, then

Use constant c1 for Winsorizing points in first and third quadrants Use c2 = √ hc1 for second and fourth quadrants

R0 is correlation matrix of adjusted Winsorized data

Robust selection procedures Stefan Van Aelst 61

slide-62
SLIDE 62

Robust variable selection: sequencing

Initial correlation estimate

Adjusted Winsorization: Univariate Winsorization with different tuning constants for different quadrants.

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 62

slide-63
SLIDE 63

Robust variable selection: sequencing

Initial correlation estimate

Univariate Winsorization

−4 −2 2 −4 −3 −2 −1 1 2 3 variable 1 variable 2 Robust selection procedures Stefan Van Aelst 63

slide-64
SLIDE 64

Robust variable selection: sequencing

Correlation M-estimators

1 First center the two variables using their medians 2 An M-estimate of the covariance matrix is the solution V of

the equation 1 n

  • i

u2(d2

i )xix′ i = V,

where d2

i = x′ iV−1xi and u2(t) = min(χ2 2(0.99)/t, 1) 3 Calculate the correlation corresponding to the bivariate

covariance matrix V

Robust selection procedures Stefan Van Aelst 64

slide-65
SLIDE 65

Robust variable selection: sequencing

Gnanadesikan-Kettenring correlation estimators

Consider the identity cov(X, Y) = 1 4

  • sd(X + Y)2 − sd(X − Y)2

Replace the sample standard deviations by robust estimates of scale to obtain robust correlation estimates

Robust selection procedures Stefan Van Aelst 65

slide-66
SLIDE 66

Robust variable selection: sequencing

Robust correlations: Computational efficiency

10000 20000 30000 40000 50000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 sample size cpu time Uni−Winsor Adj−Winsor Bi−Winsor Maronna

Robust selection procedures Stefan Van Aelst 66

slide-67
SLIDE 67

Robust variable selection: sequencing

Robust LAR: Computational efficiency

Computational efficiency of correlations largely determines computing time of Robust LAR

50 100 150 200 250 300 50 100 150 dimension cpu time LARS W−RLARS M−RLARS

Robust selection procedures Stefan Van Aelst 67

slide-68
SLIDE 68

Robust variable selection: sequencing

Bootstrapping the sequencing algorithms

Use bootstrap averages to obtain more reliable and stable sequences Procedure:

1 Generate 50 bootstrap samples 2 Sequence predictors in each sample 3 Rank predictors according to their average rank over the

bootstrap samples

Not all predictors have to be ranked in each bootstrap sample

Robust selection procedures Stefan Van Aelst 68

slide-69
SLIDE 69

Robust variable selection: sequencing

Bootstrap effect on robust LAR

Simulation design Samples of size 150 in 200 dimensions 10 target predictors 20 noise covariates correlated with target predictors 170 independent noise covariates 10% of symmetric or asymmetric high leverage outliers We compare with random forests using variable importance measures to sequence the variables

Robust selection procedures Stefan Van Aelst 69

slide-70
SLIDE 70

Robust variable selection: sequencing

Bootstrap RLAR vs RLAR/Random Forests

Symmetric high leverage Asymmetric high leverage

10 20 30 40 50 2 4 6 8 10 Number of Variables Number of Target Variables B−RLARS RLARS RF−OOB RF−IMP 10 20 30 40 50 2 4 6 8 Number of Variables Number of Target Variables

Robust selection procedures Stefan Van Aelst 70

slide-71
SLIDE 71

Robust variable selection: sequencing

Example: Demographic data

n = 50 states of USA, d = 25 covariates. Response y = murder rate One outlier 5-fold cross validation selects a model with 7 variables We sequence the variables using B-RLARS Construct learning curve

Graphical tool to select the size of reduced sequence in practice Based on a robust R2 measure: e.g. R2 = 1 − Med(residual2) MAD2(y)

Robust selection procedures Stefan Van Aelst 71

slide-72
SLIDE 72

Robust variable selection: sequencing

Demographic data: learning curve

Learning curve

5 10 15 20 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Number of variables in the model Learning rate

⇒ Reduced set of at most 12 predictors

Robust selection procedures Stefan Van Aelst 72

slide-73
SLIDE 73

Robust variable selection: sequencing

Demographic data: models

Full CV model: 7 predictors B-RLAR+CV: 6 predictors LAR+CV: 8 predictors RF-SEL: 5 predictors RF-SEL+CV: 4 predictors RF-RED+CV: 5 predictors MSVM-RFE: 8 predictors MSVM-RFE+CV: 6 predictors

Robust selection procedures Stefan Van Aelst 73

slide-74
SLIDE 74

Robust variable selection: sequencing

Demographic data: model comparison

Density estimates based on 1000 5-fold CV-MSPE estimates.

200 400 600 800 1000 0.000 0.005 0.010 0.015 5−fold CV−MSPE density

Full−CV LARS+CV B−RLARS+CV RF−SEL+CV RF−RED+CV MSVM−RFE

Robust selection procedures Stefan Van Aelst 74

slide-75
SLIDE 75

Robust variable selection: sequencing

Example: Protein data

n = 4141 protein sequences, d = 77 covariates. Training sample of size 2072 and test sample of size 2069. We selected predictors using

B-RLAR: 5 predictors RF using OOB importance: 22 predictors MSVM-RFE: 22 predictors

For RF we could determine an optimal submodel in the reduced sequences using robust MM-estimates with robust

  • FPE. ⇒ RF+RFPE: 18 predictors

Robust selection procedures Stefan Van Aelst 75

slide-76
SLIDE 76

Robust variable selection: sequencing

Protein data: test sample errors

Trimmed means of squared prediction errors Trimming fraction Model 1% 5% 10% B-RLAR 116.19 97.73 84.67 RF 111.11 93.80 81.30 RF-RFPE 111.30 93.92 81.27 MSVM-RFE 173.70 150.48 133.17

Robust selection procedures Stefan Van Aelst 76

slide-77
SLIDE 77

Robust variable selection: sequencing

Example: Particle data

Quantum physics data with d = 64 predictors. Training sample of size 5,000, test sample of size 45,000. FS and SW produced a model with 25 predictors. Robust FS and SW produced a model with only 1 predictor. Indeed for more than 80% of the cases X1 = Y = 0. For the cases with X1 = 0, FS produced a model with 5 predictors. We fit the final models using MM-estimators.

Robust selection procedures Stefan Van Aelst 77

slide-78
SLIDE 78

Robust variable selection: sequencing

Particle data: test sample errors

Trimmed means of squared prediction errors Trimming fraction Model 1% 5% FS 0.110 0.012 Robust FS 0.032 0.001

Robust selection procedures Stefan Van Aelst 78

slide-79
SLIDE 79

Robust variable selection: segmentation

Segmentation: Robust adjusted R-squared

Adjusted R2: A(α) = 1 − RSS(α)/(n − d(α)) RSS(1)/(n − 1) Based on a robust regression estimator we can construct a robust adjusted R2: RR2

a(α) = 1 −

ˆ σ2

α

(n − d(α))

  • ˆ

σ2 (n − 1) , ˆ σα is the robust residual scale of the submodel with predictor indexed by α ˆ σ0 is the robust residual scale of the intercept-only model

Robust selection procedures Stefan Van Aelst 79

slide-80
SLIDE 80

Robust variable selection: segmentation

Segmentation: Robust FPE

  • FPE(α) = RSS(α)

ˆ σ2

+ 2d(α) estimates the final prediction error FPE(α) = 1 σ2

n

  • i=1

E

  • (zi − x′

αiˆ

βα)2 , assuming that the model is correct. Consider now the robust final prediction error: RFPE(α) =

n

  • i=1

E

  • ρ
  • zi − x′

αiˆ

βα σ

  • . Assuming that the

model is correct and using a second order Taylor expansion, this can be estimated by

  • RFPE(α) = n

i=1 ρ(ri(ˆ

βα)/ˆ σn) + d(α)

n

i=1 ψ2(ri(ˆ

βα)/ˆ σn) n

i=1 ψ′(ri(ˆ

βα)/ˆ σn)

ˆ σn is the robust scale estimate of a ’full’ model αf . Usually, αf = {1, . . . , d}

Robust selection procedures Stefan Van Aelst 80

slide-81
SLIDE 81

Robust variable selection: segmentation

Robust resampling based selection criteria

Robust equivalents of the resampling based selection criteria:

  • RPE(α) = ˆ

σ2

n

n E⋆

  • n
  • i=1

ρ

  • zi − x′

αiˆ

βα ˆ σn

  • y, X
  • PRPE(α) = ˆ

σ2

n

n n

  • i=1

ρ

  • yi − x′

αiˆ

βα ˆ σn

  • + f(n) d(α)
  • + ˜

Mn(α) ρ is the MM loss function and ˆ βα,n is the MM estimate f(n)d(α) is the penalty term with e.g. f(n) = 2 log n ˆ σn is the robust scale estimate of a ’full’ model αf . Usually, αf = {1, . . . , d} E⋆ is a robust resampling estimate of the expected value

Robust selection procedures Stefan Van Aelst 81

slide-82
SLIDE 82

Robust variable selection: segmentation

Robustness and resampling

Resampling robust estimators causes problems with robustness speed Stratified bootstrap (Müller and Welsh, JASA, 2005) only solves the first problem. − → Limited practical use. The fast and robust bootstrap solves both problems.

Robust selection procedures Stefan Van Aelst 82

slide-83
SLIDE 83

Robust variable selection: segmentation

MM-estimators revisited

For the model comparison we use slightly adjusted MM-estimators: The MM-estimates ˆ βα satisfy 1 n

n

  • i=1

ψ1

  • yi − x′

αiˆ

βα ˆ σn

  • xαi = 0 ,

where ˆ σn minimizes the M-scale ˆ σn(β), which for any β ∈ Rd is defined as the solution that satisfies 1 n

n

  • i=1

ρ0 yi − x′

ˆ σn(β)

  • = b

ρ0 determines the breakdown point (S-estimator) ρ1 determines the efficiency (MM-estimator)

Robust selection procedures Stefan Van Aelst 83

slide-84
SLIDE 84

Robust variable selection: segmentation

Bootstrapping MM-estimates

Weighted least squares representation of MM-estimator ˆ βα,n = n

  • i=1

ωαi xαi x′

αi

−1

n

  • i=1

ωαi xαi yi with ωαi = ρ′

1(rαi/ˆ

σn)/rαi and rαi = yi − ˆ βα,n

′xαi

Let (y⋆

i , x⋆ αi), i = 1, . . . , m be a bootstrap sample of size

m ≤ n. Then ˆ β

⋆ α satisfies

ˆ β

⋆ α,m =

m

  • i=1

ω⋆

αi x⋆ αi x⋆′ αi

−1

m

  • i=1

ω⋆

αi x⋆ αi y⋆ i

with ω⋆

αi = ρ′ 1(r⋆ αi/ˆ

σ⋆

n)/r⋆ αi and r⋆ αi = y⋆ i − ˆ

β

⋆ α,m ′x⋆ αi

Robust selection procedures Stefan Van Aelst 84

slide-85
SLIDE 85

Robust variable selection: segmentation

Fast and robust bootstrap

Weighted least squares representation of MM-estimator ˆ βα,n = n

  • i=1

ωαi xαi x′

αi

−1

n

  • i=1

ωαi xαi yi with ωαi = ρ′

1(rαi/ˆ

σn)/rαi and rαi = yi − ˆ βα,n

′xαi

Let (y⋆

i , x⋆ αi), i = 1, . . . , m be a bootstrap sample of size

m ≤ n. Define ˆ β

1,⋆ α by

ˆ β

1,⋆ α,m =

m

  • i=1

ω⋆

αi x⋆ αi x⋆′ αi

−1

m

  • i=1

ω⋆

αi x⋆ αi y⋆ i

with ω⋆

αi = ρ′ 1(r⋆ αi/ˆ

σn)/r⋆

αi and r⋆ αi = y⋆ i − ˆ

βα,n

′x⋆ αi

Note that ˆ βα,n and ˆ σn are not recalculated!

Robust selection procedures Stefan Van Aelst 85

slide-86
SLIDE 86

Robust variable selection: segmentation

Fast and robust bootstrap

The estimates ˆ β

1,⋆ α,m will under-estimate the variability of the

completely recalculated estimates ˆ β

⋆ α,m

→ A correction is needed The fast and robust bootstrap estimates ˆ β

R∗ α,m are given by

ˆ β

R∗ α,m = ˆ

βα,n + Kα,n

  • ˆ

β

1,⋆ α,m − ˆ

βα,n

  • where

Kα,n = ˆ σn n

  • i=1

ρ′′

1 (rαi/ ˆ

σn) xαi x′

αi

−1

n

  • i=1

ωαixαi x′

αi

Note that Kα,n is computed only once for the original sample.

Robust selection procedures Stefan Van Aelst 86

slide-87
SLIDE 87

Robust variable selection: segmentation

Properties of fast and robust bootstrap

Computationally efficient: weighted least squares calculations Robust: No recalculation of observation weights

Robust selection procedures Stefan Van Aelst 87

slide-88
SLIDE 88

Robust variable selection: segmentation

Consistent model selection

Suppose a true model α0 ⊂ {1, . . . , d} exists and is included in the set A of models considered. If we select the model that minimizes RPE(α) or PRPE(α), that is ˆ αm n = argminα∈A RPE(α) and ˜ αm n = argminα∈A PRPE(α), then, under appropriate regularity conditions, the model selection criteria are consistent in the sense that lim

n→∞ P(ˆ

αm,n = α0) = 1 and lim

n→∞ P(˜

αm,n = α0) = 1 . Two conditions have practical consequences m = o(n) (m out of n bootstrap) f(n) = o(n/m)

Robust selection procedures Stefan Van Aelst 88

slide-89
SLIDE 89

Robust variable selection: segmentation

Consistent model selection

Suppose a true model α0 ⊂ {1, . . . , d} exists and is included in the set A of models considered. If we select the model that minimizes RPE(α) or PRPE(α), that is ˆ αm n = argminα∈A RPE(α) and ˜ αm n = argminα∈A PRPE(α), then, under appropriate regularity conditions, the model selection criteria are consistent in the sense that lim

n→∞ P(ˆ

αm,n = α0) = 1 and lim

n→∞ P(˜

αm,n = α0) = 1 . Two conditions have practical consequences m = o(n) (m out of n bootstrap) f(n) = o(n/m)

Robust selection procedures Stefan Van Aelst 89

slide-90
SLIDE 90

Robust variable selection: segmentation

Examples

We compare the full model with models selected by backward elimination based on

  • RPE(α)
  • PRPE(α) with f(n) = log(n)

RFPE

For each of the models we report RR2

a(α), the adjusted

robust R2 To compare predictive power we calculated the 5-fold CV trimmed MSPE

Robust selection procedures Stefan Van Aelst 90

slide-91
SLIDE 91

Robust variable selection: segmentation

Example 1: Ozone data

Los Angeles Ozone Pollution Data, 1976 366 observations (different days) on 9 variables Response: temperature (degrees F) at El Monte, CA Covariates: Measurements of temperature, pressure, humidity, ozone, etc at other places in CA. We start from the full quadratic model (d = 45)

model size RR2

a

5% Trimmed MSPE Full 45 0.8660 10.78 RFPE 23 0.8174 10.66 ˜ αm,n 10 0.7583 11.67 ˆ αm,n 7 0.7643 10.45

Robust selection procedures Stefan Van Aelst 91

slide-92
SLIDE 92

Robust variable selection: segmentation

Example 2: Diabetes data

442 observations on 16 variables. Response: Measure of disease progression one year after baseline Covariates: 10 baseline variables (age, sex, BMI, blood pressure, ...) We start from a quadratic model with some interactions (d = 65)

model size RR2

a

5% Trimmed MSE Full 65 0.7731 4988.1 RFPE 16 0.6045 2231.2 ˜ αm,n 11 0.5127 2657.2 ˆ αm,n 7 0.5302 2497.0

Robust selection procedures Stefan Van Aelst 92

slide-93
SLIDE 93

References

Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007). Building a Robust Linear Model with Forward Selection and Stepwise Procedures. Computational Statistics and Data Analysis, 52, 239-248.

Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007). Robust Linear Model Selection Based on Least Angle Regression. Journal of the American Statistical Association, 102, 1289-1299.

Lutz, R.W., Kalisch, M., and Bühlmann, P . (2008). Robustified L2 boosting. Computational Statistics and Data Analysis, 52, 3331-3341.

Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006). Robust Statistics: Theory and Methods, Wiley: New York.

Salibian-Barrera, M. and Van Aelst S. (2007). Robust Model Selection Using Fast and Robust Bootstrap. Computational Statistics and Data Analysis, 52, 5121-5135

Robust selection procedures Stefan Van Aelst 93