Variable Screening You will often have many candidate variables to - - PowerPoint PPT Presentation

variable screening
SMART_READER_LITE
LIVE PREVIEW

Variable Screening You will often have many candidate variables to - - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variable Screening You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be


slide-1
SLIDE 1

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Variable Screening

You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be infeasible (more parameters than

  • bservations).

Even if feasible, a prediction equation with many parameters may not perform well: in validation; in application.

1 / 22 Variable Screening Methods Introduction

slide-2
SLIDE 2

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Stepwise regression

How to choose the subset to use? One approach: stepwise regression. Example Executive salary, with 10 candidate variables:

execSal <- read.table("Text/Exercises&Examples/EXECSAL2.txt", header = TRUE) execSal[1:5,] pairs(execSal[,-1])

2 / 22 Variable Screening Methods Stepwise Regression

slide-3
SLIDE 3

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Variables X1 Experience (years) X2 Education (years) X3 Gender (1 if male, 0 if female) X4 Number of employees supervised X5 Corporate assets ($ millions) X6 Board member (1 if yes, 0 if no) X7 Age (years) X8 Company profits (past 12 months, $ millions) X9 Has international responsibility (1 if yes, 0 if no) X10 Company’s total sales (past 12 months, $ millions)

3 / 22 Variable Screening Methods Stepwise Regression

slide-4
SLIDE 4

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Note that X3, X6, and X9 are indicator variables. The complete second-order model is quadratic in the other 7 variables, with interactions with all combinations of the indicator variables. A quadratic function of 7 variables has 36 coefficients. The complete second-order model has 36 × 8 = 288 parameters. Infeasible: the data set has only 100 observations.

4 / 22 Variable Screening Methods Stepwise Regression

slide-5
SLIDE 5

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Forward stepwise selection First, consider all the one-variable models E(Y ) = β0 + βjxj, j = 1, 2, . . . , k. For each, test the hypothesis H0 : βj = 0 at some level α. If none is significant, the model is E(Y ) = β0. Otherwise, choose the best (in terms of R2, R2

a, |t|, |r|; it doesn’t

matter); call the variable xj1.

5 / 22 Variable Screening Methods Stepwise Regression

slide-6
SLIDE 6

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Now consider all two-variable models that include xj1: E(Y ) = β0 + βj1xj1 + βjxj, j = j1. For each, test the significance of the new coefficient βj. If none is significant, the model is E(Y ) = β0 + βj1xj1. Otherwise, choose the best new variable; call it xj2. Continue adding variables until no remaining variable is significant at level α.

6 / 22 Variable Screening Methods Stepwise Regression

slide-7
SLIDE 7

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Backward stepwise elimination Alternatively, begin with the model containing all the variables, the full first-order model (assuming you can fit it). Test the significance of each coefficient at some level α. If all are significant, use that model. Otherwise, eliminate the least significant variable (smallest |t|, smallest reduction in R2, . . . ; again, it doesn’t matter which). Continue eliminating variables until all remaining variables are significant at level α.

7 / 22 Variable Screening Methods Stepwise Regression

slide-8
SLIDE 8

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Either forward selection or backward elimination could be used to select a subset of variables for further study. Problem: forward selection and backward elimination may identify different subsets.

8 / 22 Variable Screening Methods Stepwise Regression

slide-9
SLIDE 9

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Bidirectional stepwise regression A combination of forward selection and backward elimination. Choose a starting model; it could be: no independent variables; all independent variables; some other subset of independent variables suggested a priori. Look for a variable to add to the model, by adding each candidate,

  • ne at a time, and testing the significance of the coefficient.

Then look for a variable to eliminate, by testing all coefficients. You could use a different α-to-enter and α-to-remove, with αenter < αremove.

9 / 22 Variable Screening Methods Stepwise Regression

slide-10
SLIDE 10

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Repeat both steps until no variable can be added or eliminated. The final model is one at which both forward selection and backward elimination would terminate. But it is still possible that you get different final models depending on the choice of initial model.

10 / 22 Variable Screening Methods Stepwise Regression

slide-11
SLIDE 11

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Criterion-based stepwise regression In hypothesis-test-based subset selection, many tests are used. Each test, in isolation, has a specified error rate α. The per-test error rate α controls the choice of final subset, but in an indirect way.

11 / 22 Variable Screening Methods Stepwise Regression

slide-12
SLIDE 12

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Modern methods are instead based on improving a criterion such as: Adjusted coefficient of determination, R2

a;

MSE, s2 (equivalent to R2

a);

Mallows’s Cp criterion; PRESS criterion; Akaike’s information criterion, AIC. PRESS and AIC are equivalent when n is large.

12 / 22 Variable Screening Methods Stepwise Regression

slide-13
SLIDE 13

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

In R, using AIC, starting with the empty model:

start <- lm(Y ~ 1, execSal) all <- Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 summary(step(start, scope = all))

Starting with the full model:

# no scope, so direction defaults to "backward": summary(step(lm(all, execSal))) summary(step(lm(all, execSal), direction = "both"))

13 / 22 Variable Screening Methods Stepwise Regression

slide-14
SLIDE 14

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Note: AIC = n log(ˆ σ2) + 2(k + 1) = n log SSE n

  • + 2(k + 1)

= n log(SSE) + 2(k + 1) [−n log n]. This works well when choosing from nested models. But in the example, the 5-variable model is the best of 10

5

  • = 252

possible models. Some statisticians prefer the Bayesian Information Criterion BIC = n log(ˆ σ2) + (log n)(k + 1)

14 / 22 Variable Screening Methods Stepwise Regression

slide-15
SLIDE 15

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

BIC imposes a higher penalty on the number of parameters in the model. In R:

summary(step(start, scope = all, k = log(nrow(execSal))))

Final model is the same in this case, and will never be larger, but may be smaller.

15 / 22 Variable Screening Methods Stepwise Regression

slide-16
SLIDE 16

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Best Subset Regression

When used with a criterion, stepwise regression terminates with a subset of variables that cannot be improved by adding or dropping a single variable. That it, it is locally optimal. But some other subset may have a better value of the criterion. In R, the bestglm package implements best subset regression for various criteria.

16 / 22 Variable Screening Methods Best Subset Regression

slide-17
SLIDE 17

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Concerning best subset methods, the text asserts that these techniques lack the objectivity of a stepwise regression procedure. I disagree. Finding the subset of variables that optimizes some criterion is completely objective. In fact, because of the opaque way that choosing α controls the procedure, I argue that stepwise regression lacks the transparency of best subset regression

17 / 22 Variable Screening Methods Best Subset Regression

slide-18
SLIDE 18

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Why not use stepwise methods to build a complete model?

We need to try second-order terms like products of independent variables (interactions) and squared terms (curvatures). Some software tools do not know that an interaction should be included only if both main effects are also included–but step() does. Try the full second-order model:

all <- Y ~ ((X1 + X2 + X4 + X5 + X7 + X8 + X10)^2 + I(X1^2) + I(X2^2) + I(X4^2) + I(X5^2) + I(X7^2) + I(X8^2) + I(X10^2)) * X3 * X6 * X9 summary(step(start, scope = all, k = log(nrow(execSal))))

18 / 22 Variable Screening Methods Caveats

slide-19
SLIDE 19

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Call: lm(formula = Y ~ X1 + X3 + I(X4^2) + X2 + X5 + I(X1^2) + I(X2^2) + X3:I(X4^2), data = execSal) Residuals: Min 1Q Median 3Q Max

  • 0.165099 -0.048349

0.001662 0.052617 0.138882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.232e+00 3.356e-01 27.509 < 2e-16 *** X1 4.624e-02 3.747e-03 12.341 < 2e-16 *** X3 1.491e-01 2.511e-02 5.938 5.19e-08 *** I(X4^2) 4.733e-07 1.087e-07 4.354 3.50e-05 *** X2 1.218e-01 4.247e-02 2.867 0.005151 ** X5 2.237e-03 4.387e-04 5.100 1.84e-06 *** I(X1^2)

  • 7.276e-04

1.374e-04

  • 5.294 8.24e-07 ***

I(X2^2)

  • 2.927e-03

1.341e-03

  • 2.183 0.031627 *

X3:I(X4^2) 4.680e-07 1.307e-07 3.581 0.000552 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.06477 on 91 degrees of freedom Multiple R-squared: 0.9429, Adjusted R-squared: 0.9379 F-statistic: 187.7 on 8 and 91 DF, p-value: < 2.2e-16

19 / 22 Variable Screening Methods Caveats

slide-20
SLIDE 20

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Oops! The model includes I(X4^2) but not X4. step() is smart enough not to include an interaction without all its main effects, but does not know that I(X4^2) should not be included without X4. We can fix it manually, by forcing X4 into the model:

start <- lm(Y ~ X4, execSal) summary(step(start, scope = list(lower = Y ~ X4, upper = all), k = log(nrow(execSal))))

20 / 22 Variable Screening Methods Caveats

slide-21
SLIDE 21

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Call: lm(formula = Y ~ X4 + X1 + X3 + X2 + X5 + I(X1^2) + X4:X3, data = execSal) Residuals: Min 1Q Median 3Q Max

  • 0.163466 -0.048971 -0.001111

0.041345 0.124534 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.862e+00 9.703e-02 101.634 < 2e-16 *** X4 3.259e-04 7.850e-05 4.152 7.36e-05 *** X1 4.364e-02 3.761e-03 11.604 < 2e-16 *** X3 1.166e-01 3.696e-02 3.155 0.00217 ** X2 3.094e-02 2.950e-03 10.487 < 2e-16 *** X5 2.391e-03 4.439e-04 5.386 5.49e-07 *** I(X1^2)

  • 6.347e-04

1.384e-04

  • 4.588 1.41e-05 ***

X4:X3 3.020e-04 9.239e-05 3.269 0.00152 **

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.06596 on 92 degrees of freedom Multiple R-squared: 0.9401, Adjusted R-squared: 0.9355 F-statistic: 206.3 on 7 and 92 DF, p-value: < 2.2e-16

21 / 22 Variable Screening Methods Caveats

slide-22
SLIDE 22

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Footnote The model that we arrived at using BIC-based stepwise regression is the same as was used in Example 4.10, where it was proposed with no discussion. Testing for Gender bias

lmFull <- lm(Y ~ X4 + X1 + X3 + X2 + X5 + I(X1^2) + X4:X3, data = execSal) lmReduced <- lm(Y ~ X4 + X1 + X2 + X5 + I(X1^2), data = execSal) anova(lmReduced, lmFull) Analysis of Variance Table Model 1: Y ~ X4 + X1 + X2 + X5 + I(X1^2) Model 2: Y ~ X4 + X1 + X3 + X2 + X5 + I(X1^2) + X4:X3 Res.Df RSS Df Sum of Sq F Pr(>F) 1 94 1.54018 2 92 0.40029 2 1.1399 130.99 <2.2e-16 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

22 / 22 Variable Screening Methods Caveats