Variable Screening You will often have many candidate variables to - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variable Screening You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be infeasible (more parameters than observations). Even if feasible, a prediction equation with many parameters may not perform well: in validation; in application. 1 / 22 Variable Screening Methods Introduction

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Stepwise regression How to choose the subset to use? One approach: stepwise regression. Example Executive salary, with 10 candidate variables: execSal <- read.table("Text/Exercises&Examples/EXECSAL2.txt", header = TRUE) execSal[1:5,] pairs(execSal[,-1]) 2 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variables X1 Experience (years) X2 Education (years) X3 Gender (1 if male, 0 if female) X4 Number of employees supervised X5 Corporate assets ($ millions) X6 Board member (1 if yes, 0 if no) X7 Age (years) X8 Company profits (past 12 months, $ millions) X9 Has international responsibility (1 if yes, 0 if no) X10 Company’s total sales (past 12 months, $ millions) 3 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Note that X3 , X6 , and X9 are indicator variables. The complete second-order model is quadratic in the other 7 variables, with interactions with all combinations of the indicator variables. A quadratic function of 7 variables has 36 coefficients. The complete second-order model has 36 × 8 = 288 parameters. Infeasible: the data set has only 100 observations. 4 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Forward stepwise selection First, consider all the one-variable models E ( Y ) = β 0 + β j x j , j = 1 , 2 , . . . , k . For each, test the hypothesis H 0 : β j = 0 at some level α . If none is significant, the model is E ( Y ) = β 0 . Otherwise, choose the best (in terms of R 2 , R 2 a , | t | , | r | ; it doesn’t matter); call the variable x j 1 . 5 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Now consider all two-variable models that include x j 1 : E ( Y ) = β 0 + β j 1 x j 1 + β j x j , j � = j 1 . For each, test the significance of the new coefficient β j . If none is significant, the model is E ( Y ) = β 0 + β j 1 x j 1 . Otherwise, choose the best new variable; call it x j 2 . Continue adding variables until no remaining variable is significant at level α . 6 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Backward stepwise elimination Alternatively, begin with the model containing all the variables, the full first-order model (assuming you can fit it). Test the significance of each coefficient at some level α . If all are significant, use that model. Otherwise, eliminate the least significant variable (smallest | t | , smallest reduction in R 2 , . . . ; again, it doesn’t matter which). Continue eliminating variables until all remaining variables are significant at level α . 7 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Either forward selection or backward elimination could be used to select a subset of variables for further study. Problem: forward selection and backward elimination may identify different subsets. 8 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Bidirectional stepwise regression A combination of forward selection and backward elimination. Choose a starting model; it could be: no independent variables; all independent variables; some other subset of independent variables suggested a priori. Look for a variable to add to the model, by adding each candidate, one at a time, and testing the significance of the coefficient. Then look for a variable to eliminate, by testing all coefficients. You could use a different α -to-enter and α -to-remove, with α enter < α remove . 9 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Repeat both steps until no variable can be added or eliminated. The final model is one at which both forward selection and backward elimination would terminate. But it is still possible that you get different final models depending on the choice of initial model. 10 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Criterion-based stepwise regression In hypothesis-test-based subset selection, many tests are used. Each test, in isolation, has a specified error rate α . The per-test error rate α controls the choice of final subset, but in an indirect way. 11 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Modern methods are instead based on improving a criterion such as: Adjusted coefficient of determination, R 2 a ; MSE, s 2 (equivalent to R 2 a ); Mallows’s C p criterion; PRESS criterion; Akaike’s information criterion, AIC. PRESS and AIC are equivalent when n is large. 12 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II In R, using AIC, starting with the empty model: start <- lm(Y ~ 1, execSal) all <- Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 summary(step(start, scope = all)) Starting with the full model: # no scope, so direction defaults to "backward": summary(step(lm(all, execSal))) summary(step(lm(all, execSal), direction = "both")) 13 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Note: σ 2 ) + 2( k + 1) AIC = n log(ˆ � SSE � = n log + 2( k + 1) n = n log(SSE) + 2( k + 1) [ − n log n ] . This works well when choosing from nested models. � 10 � But in the example, the 5-variable model is the best of = 252 5 possible models. Some statisticians prefer the Bayesian Information Criterion σ 2 ) + (log n )( k + 1) BIC = n log(ˆ 14 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II BIC imposes a higher penalty on the number of parameters in the model. In R: summary(step(start, scope = all, k = log(nrow(execSal)))) Final model is the same in this case, and will never be larger, but may be smaller. 15 / 22 Variable Screening Methods Stepwise Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Best Subset Regression When used with a criterion, stepwise regression terminates with a subset of variables that cannot be improved by adding or dropping a single variable. That it, it is locally optimal. But some other subset may have a better value of the criterion. In R, the bestglm package implements best subset regression for various criteria. 16 / 22 Variable Screening Methods Best Subset Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Concerning best subset methods, the text asserts that these techniques lack the objectivity of a stepwise regression procedure. I disagree. Finding the subset of variables that optimizes some criterion is completely objective. In fact, because of the opaque way that choosing α controls the procedure, I argue that stepwise regression lacks the transparency of best subset regression 17 / 22 Variable Screening Methods Best Subset Regression

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Why not use stepwise methods to build a complete model? We need to try second-order terms like products of independent variables (interactions) and squared terms (curvatures). Some software tools do not know that an interaction should be included only if both main effects are also included–but step() does. Try the full second-order model: all <- Y ~ ((X1 + X2 + X4 + X5 + X7 + X8 + X10)^2 + I(X1^2) + I(X2^2) + I(X4^2) + I(X5^2) + I(X7^2) + I(X8^2) + I(X10^2)) * X3 * X6 * X9 summary(step(start, scope = all, k = log(nrow(execSal)))) 18 / 22 Variable Screening Methods Caveats

Variable Screening You will often have many candidate variables to - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variable Screening You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Metal - - screening screening Metal Thomas-Fermi (static) screening potential of point charge

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Colorectal Cancer Screening Fall 2018 Agenda CRC Screening Landscape Colonoscopy: The

DIABETIC EYE SCREENING What is Diabetic Eye Screening? Diabetic eye screening means taking a

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

Diabetic Eye Screening Extended Screening Intervals Public Health England leads the NHS Screening

Screening Screening By: Michael OReilly Technical Advisor FETP Thailand Session Objectives

Goals CRC/Screening Facts Available CRC Screening Tests Tools To Help you Talk About

Gaussian ensemble screening (GES): A new Gaussian ensemble screening (GES): A new approach to

1 2 3 4 5 F I N A N C I A L P R O T E C T I O N S S E R V I C E SCREENING The Screening

Participant Manual August 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Participant Manual July 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Graphlet Screening (GS) Achieves Optimal Rate in Variable Selection Jiashun Jin Carnegie Mellon

Variable Benefit Plans in Depth Kelly Coffing, FSA, EA, MAAA September 21, 2019 Agenda The

Measuring variable importance in random forests Variable Variable importance in RF importance

JavaScript Dr. Steven Bitner What is it? JavaScript is a scripting language No compiling

3.8 Functions of random variables 3.7, 3.9, 3.11 Multiple random variables (discrete) Prof.

CSE443 Compilers Dr. Carl Alphonce ruhansa@buffalo.edu Ruhan Sa alphonce@buffalo.edu 343

Probability and Statistics for Computer Science The weak law of large numbers gives us a

CMPS 112: Spring 2019 Comparative Programming Languages Course review Owen Arden

Variables Motivation This series introduces the concept of variables Very powerful

Exploring interpretable LSTM neural networks over multi-variable data Sebastian U. Stich (MLO,

Variables 2 Variables 2 With slides from Chris Piech With slides from Chris Piech Comparison