Coded variables Some variables can be represented on different - - PowerPoint PPT Presentation

coded variables
SMART_READER_LITE
LIVE PREVIEW

Coded variables Some variables can be represented on different - - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Coded variables Some variables can be represented on different scales. E.g., temperature in degrees Celsius or Fahrenheit. Suppose some response


slide-1
SLIDE 1

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Coded variables

Some variables can be represented on different scales. E.g., temperature in degrees Celsius or Fahrenheit. Suppose some response Y is modeled as a linear function of temperature: E(Y ) = β0 + β1x, with x = temperature in degrees Fahrenheit.

1 / 23 Principles of Model Building Coding Independent Variables

slide-2
SLIDE 2

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

If x∗ = temperature in degrees Celsius, then x = 32 + 1.8x∗. So E(Y ) = β0 + β1 (32 + 1.8x∗) = (β0 + 32β1) + (1.8β1) x∗ = β∗

0 + β∗ 1x∗,

where β∗

0 = β0 + 32β1

and β∗

1 = 1.8β1.

2 / 23 Principles of Model Building Coding Independent Variables

slide-3
SLIDE 3

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

So if Y is linearly related to x, then it is also linearly related to x∗, with different coefficients β∗

0 and β∗ 1.

We sometimes code variables to make an equation more easily interpreted. When a variable takes only two distinct values, we often code them as −1 and +1. E.g., if x is temperature with levels 80◦F and 100◦F, and x∗ = (x − 90)/10, then x∗ = −1 when x = 80, and x∗ = 1 when X = 100.

3 / 23 Principles of Model Building Coding Independent Variables

slide-4
SLIDE 4

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

A variable with three levels can similarly be coded as −1, 0, and +1, provided the three levels are equally spaced. The interpretation of the corresponding coefficient β∗ is, as always, the change in E(Y ) when x∗ changes by 1, with all other variables fixed. But with a variable coded like this, a change of 1 in x∗ means moving, say, from the midpoint value to the high value. The corresponding change in E(Y ) is often called the effect of the variable.

4 / 23 Principles of Model Building Coding Independent Variables

slide-5
SLIDE 5

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

When a variable takes more than two or three values, it is sometimes standardized: x∗

i = ui = xi − ¯

x sx . All coefficients are then in the units of Y , so they can be compared numerically. If Y is also standardized, the coefficients are dimensionless. These are called standardized regression coefficients, and are widely used in some fields. Despite what the text says, standardization has no effect on computational errors, with modern algorithms.

5 / 23 Principles of Model Building Coding Independent Variables

slide-6
SLIDE 6

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Models with One Qualitative Variable

Recall: a qualitative variable with l levels is represented by (l − 1) indicator (or dummy) variables. For a chosen reference level, all the indicator variables are 0; For each other level, the corresponding indicator variable is 1, and the others are 0.

6 / 23 Principles of Model Building Models with One Qualitative Variable

slide-7
SLIDE 7

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example Per-user software maintenance cost, by state (sample of 10 users per state).

path <- file.path("Text", "Exercises&Examples", "BIDMAINT.txt") maint <- read.table(path, header = TRUE) plot(COST ~ STATE, maint) summary(lm(COST ~ STATE, maint)) Call: lm(formula = COST ~ STATE, data = maint) Residuals: Min 1Q Median 3Q Max

  • 299.80
  • 95.83
  • 37.90

153.32 295.20

7 / 23 Principles of Model Building Models with One Qualitative Variable

slide-8
SLIDE 8

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 279.60 53.43 5.233 1.63e-05 *** STATEKentucky 80.30 75.56 1.063 0.2973 STATETexas 198.20 75.56 2.623 0.0141 *

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 168.9 on 27 degrees of freedom Multiple R-squared: 0.205, Adjusted R-squared: 0.1462 F-statistic: 3.482 on 2 and 27 DF, p-value: 0.04515

8 / 23 Principles of Model Building Models with One Qualitative Variable

slide-9
SLIDE 9

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The fitted equation is E(Y ) = 279.6 + 80.3x1 + 198.2x2 where: x1 = indicator variable for Kentucky, x2 = indicator variable for Texas. For Kansas, x1 = x2 = 0, so E(Y ) = 279.6. That is, the “intercept” is actually the expected value for the reference state, Kansas.

9 / 23 Principles of Model Building Models with One Qualitative Variable

slide-10
SLIDE 10

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

For Kentucky, x1 = 1 and x2 = 0, so E(Y ) = 279.6 + 80.3 = 359.9. That is, the coefficient STATEKentucky is the difference between the expected value for Kentucky and the expected value for the reference state. Simlilarly, the coefficient STATETexas is the difference between the expected value for Texas and the expected value for the reference state.

10 / 23 Principles of Model Building Models with One Qualitative Variable

slide-11
SLIDE 11

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

In R, the default reference level is the first in alphabetic order. The default can be overridden using the factor() function. Often these differences themselves are of no special interest, and the focus is on testing whether there are any differences: H0 : β1 = β2 = · · · = βl = 0. The value of the F-statistic is unaffected by the choice of reference level.

11 / 23 Principles of Model Building Models with One Qualitative Variable

slide-12
SLIDE 12

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Two Qualitative Variables

E.g., two brands of diesel engine and three types of fuel.

path <- file.path("Text", "Exercises&Examples", "DIESEL.txt") diesel <- read.table(path, header = TRUE) par(mfrow = c(1, 2)); plot(PERFORM ~ FUEL + BRAND, diesel)

Try main-effects model (additive, no interaction):

summary(aov(PERFORM ~ FUEL + BRAND, diesel))

Alternative interaction model:

summary(aov(PERFORM ~ FUEL * BRAND, diesel))

12 / 23 Principles of Model Building Models with Two Qualitative Variables

slide-13
SLIDE 13

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Graph the interactions:

with(diesel, interaction.plot(FUEL, BRAND, PERFORM)) with(diesel, interaction.plot(BRAND, FUEL, PERFORM))

Complicated story: For F1 and F2, effects are additive, with B1 performing better than B2; For F3, B2 performs better than B1.

13 / 23 Principles of Model Building Models with Two Qualitative Variables

slide-14
SLIDE 14

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Three or More Qualitative Variables

With a response y and independent variables a, b, c, . . . , model might contain: main effects: y ~ a + b + c + ...; two-way interactions: y ~ a + b + c + a:b + a:c + b:c + ...; higher-order interactions: y ~ a + b + c + a:b + a:c + b:c + a:b:c + ...; Often only main effects and low-order interactions are significant.

14 / 23 Principles of Model Building Three or More Qualitative Variables

slide-15
SLIDE 15

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

To estimate the highest-order interactions, we need observations for all possible combinations of levels–a factorial design. E.g., 2 × 3 = 6 for the diesel engines. With several variables, all with at least 2 levels, the number of combinations can be large. Sometimes a carefully chosen fraction of all possible combinations is used–a fractional factorial design.

15 / 23 Principles of Model Building Three or More Qualitative Variables

slide-16
SLIDE 16

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Models with Both Quantitative and Qualitative Variables

Example Diesel engine performance Y , as a function of: engine speed, x1; fuel type, with levels F1, F2, and F3; take F1 as the reference level, and x2 and x3 as indicators for F2 and F3, respectively.

16 / 23 Principles of Model Building Both Quantitative and Qualitative Variables

slide-17
SLIDE 17

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Simple model, ignoring fuel type: second-order model in x1: E(Y ) = β0 + β1x1 + β2x2

1.

Additive model: include main effects of fuel type: E(Y ) = β0 + β1x1 + β2x2

1 + β3x2 + β4x3.

Switching fuel from F1 to F2 adds β3 to the performance Y , independently of engine speed x1. Interaction model: E(Y ) = β0 + β1x1 + β2x2

1 + β3x2 + β4x3

+ β5x1x2 + β6x1x3 + β7x2

1x2 + β8x2 1x3.

17 / 23 Principles of Model Building Both Quantitative and Qualitative Variables

slide-18
SLIDE 18

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The interaction model is the complete second order model. It allows E(Y ) to be a different quadratic function of x1 at each level

  • f fuel type.

These models form a nested hierarchy, and we could choose among them using F-tests, or using an information criterion like AIC (Akaike’s Information Criterion). An intermediate model like E(Y ) = β0 + β1x1 + β2x2

1 + β3x2 + β4x3

+ β5x1x2 + β6x1x3. (that is, the interaction model with β7 = β8 = 0) might also be of interest.

18 / 23 Principles of Model Building Both Quantitative and Qualitative Variables

slide-19
SLIDE 19

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Model Validation

Regression models are usually fitted to infer something about the behavior of the expected response, E(Y ), beyond the particular sample used for fitting. Often the specific goal is to estimate E(Y ) or predict Y for some combination of independent variables not in the fitting data. A model that fits well (high R2) may not predict well; adjusted R2, R2

a, is a step in the right direction, but only a step.

19 / 23 Principles of Model Building Model Validation

slide-20
SLIDE 20

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The best validation is to actually collect new data, and compare predictions with actual responses. Label the m new responses as yn+1, . . . , yn+m: R2

prediction = 1 − n+m

  • i=n+1

(yi − ˆ yi)2

n+m

  • i=n+1

(yi − ¯ y)2 MSEprediction =

n+m

  • i=n+1

(yi − ˆ yi)2 m Note: denominator in MSEprediction is m, not m − (k + 1).

20 / 23 Principles of Model Building Model Validation

slide-21
SLIDE 21

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Almost as good: divide the original data into one part used for model building and fitting, and another “hold-out” part for validation. But for a true validation, the hold-out data are completely unused until the model building and fitting are complete.

21 / 23 Principles of Model Building Model Validation

slide-22
SLIDE 22

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Cross Validation If we do not have enough data to hold out part for a true validation, we can use cross validation: First build (and fit) a model to the whole data set. Then leave out part of the data (one half? one third? one fifth? a single observation?), refit the model to the remainder, and use the refitted model to predict the held out data. Repeat so that all parts of the data are held out in turn. Calculate R2 and MSE as for a true validation. Cross validation is not true validation, because the whole data set is used to build the model and then re-used in the validation.

22 / 23 Principles of Model Building Model Validation

slide-23
SLIDE 23

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The Jackknife The jackknife is closely related to leave-out-one cross validation. It uses deletion residuals: Delete one observation, say yi; Refit the model, and use it to predict the deleted observation as ˆ y(i); The deleted residual (or prediction residual) is di = yi − ˆ y(i). Two R2-like statistics: R2

jackknife = 1 −

yi − ˆ y(i) 2 (yi − ¯ y)2 , P2 = 1 − yi − ˆ y(i) 2 yi − ¯ y(i) 2.

23 / 23 Principles of Model Building Model Validation