STAT 213 Interactions in Multiple Regression Colin Reimer Dawson - - PowerPoint PPT Presentation

stat 213 interactions in multiple regression
SMART_READER_LITE
LIVE PREVIEW

STAT 213 Interactions in Multiple Regression Colin Reimer Dawson - - PowerPoint PPT Presentation

Outline Refresher: The Multiple Regression Model STAT 213 Interactions in Multiple Regression Colin Reimer Dawson Oberlin College 29 March 2016 Outline Refresher: The Multiple Regression Model Outline Refresher: The Multiple Regression


slide-1
SLIDE 1

Outline Refresher: The Multiple Regression Model

STAT 213 Interactions in Multiple Regression

Colin Reimer Dawson

Oberlin College

29 March 2016

slide-2
SLIDE 2

Outline Refresher: The Multiple Regression Model

Outline

Refresher: The Multiple Regression Model Defining the Model R2 and Parsimony CIs and PIs for MLR

slide-3
SLIDE 3

Outline Refresher: The Multiple Regression Model

Reading Quiz

An environmental expert is interested in modeling the concentration of various chemicals in well water over time. Identify the regression model that would be used to predict the amount of lead (Lead) in a well based on Year, with two different lines depending on whether or not the well has been cleaned (Iclean).

slide-4
SLIDE 4

Outline Refresher: The Multiple Regression Model

For Thursday

  • Read: 4.4, 7.5
  • Write up (as a lab): 3.20, 3.30
  • Answer: 4.12, 7.30
slide-5
SLIDE 5

Outline Refresher: The Multiple Regression Model

Outline

Refresher: The Multiple Regression Model Defining the Model R2 and Parsimony CIs and PIs for MLR

slide-6
SLIDE 6

Outline Refresher: The Multiple Regression Model

Outline

Refresher: The Multiple Regression Model Defining the Model R2 and Parsimony CIs and PIs for MLR

slide-7
SLIDE 7

Outline Refresher: The Multiple Regression Model

The Multiple Regression Model

DATA = PATTERN + IDIOSYNCRACIES

The Multiple Regression Population Model

Y = f(X1, . . . , XK) + ε Y = β0 + β1X1 + · · · + βkXk + ε

One βj for each predictor Xj

slide-8
SLIDE 8

Outline Refresher: The Multiple Regression Model

The Four-Step Process: Multiple Regression

  • 1. CHOOSE a form of the model
  • Select predictors
  • Choose any transformations of predictors
  • 2. FIT: Estimate
  • coefficients: ˆ

β1, ˆ β1, . . . , ˆ βk

  • residual variance ˆ

σ2

ε

  • 3. ASSESS the fit
  • Examine residuals
  • Test individual predictors (t-tests)
  • Test overall fit (ANOVA, R2)
  • 4. USE the model
  • Make predictions
  • Construct CIs and PIs
slide-9
SLIDE 9

Outline Refresher: The Multiple Regression Model

Checking Conditions

Same conditions as always apply:

  • 1. Linearity (mean of Y is given by some linear

model)

  • 2. Independence (residuals are not correlated)
  • 3. Homoskedasticity (same variance at all

combinations of X)

  • 4. Normality (residuals normally distributed)
slide-10
SLIDE 10

Outline Refresher: The Multiple Regression Model

Testing Individual Predictors (t-tests)

library(Stat2Data); data("Pulse") PulseWithBMI <- mutate( Pulse, BMI = Wgt / Hgt^2 * 703, InvActive = 1 / Active, InvRest = 1 / Rest, Male = 1 - Gender) active.model <- lm(InvActive ~ InvRest + Hgt + BMI, data = PulseWithBMI)

slide-11
SLIDE 11

Outline Refresher: The Multiple Regression Model

summary(active.model) Call: lm(formula = InvActive ~ InvRest + Hgt + BMI, data = PulseWithBMI) Residuals: Min 1Q Median 3Q Max

  • 0.0053245 -0.0010301

0.0000241 0.0011322 0.0052298 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.333e-04 2.187e-03 0.152 0.8790 InvRest 6.506e-01 5.547e-02 11.728 <2e-16 *** Hgt 5.125e-05 3.376e-05 1.518 0.1304 BMI

  • 9.052e-05

3.875e-05

  • 2.336

0.0204 *

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.001787 on 228 degrees of freedom Multiple R-squared: 0.4026,Adjusted R-squared: 0.3947 F-statistic: 51.21 on 3 and 228 DF, p-value: < 2.2e-16

slide-12
SLIDE 12

Outline Refresher: The Multiple Regression Model

Controls

In the context of a multiple regression model, the t-test for a predictor tests for a linear association after controlling for the

  • ther predictors.
slide-13
SLIDE 13

Outline Refresher: The Multiple Regression Model

Testing the Overall Model

H0 : β1 = β2 = · · · = βk = 0 H1 : Some βj = 0 F = MSModel MSError = n

i=1(ˆ

Yi − ¯ Y )2/k n

i=1(Yi − ˆ

Yi)/(n − k − 1)

density

0.2 0.4 0.6 0.8 1 2 3 4 5

slide-14
SLIDE 14

Outline Refresher: The Multiple Regression Model

Adjusted R2

  • R2 can only go up as we add predictors, because at worst,

we can choose βk+1 = βk′ = 0 and get the same SSE. Usually we can pick coefficients to do somewhat better.

  • Would like to “penalize” unnecessary predictors.
slide-15
SLIDE 15

Outline Refresher: The Multiple Regression Model

Adjusted R2

R2

adj = 1 − SSError/(n − k − 1)

SSTotal/(n − 1) = 1 − ˆ σ2

ε

s2

Y

1 − R2

adj =

1 − R2 d fError/d fTotal

slide-16
SLIDE 16

Outline Refresher: The Multiple Regression Model

Outline

Refresher: The Multiple Regression Model Defining the Model R2 and Parsimony CIs and PIs for MLR

slide-17
SLIDE 17

Outline Refresher: The Multiple Regression Model

What happens to R2 as we add predictors?

Worksheet

slide-18
SLIDE 18

Outline Refresher: The Multiple Regression Model

What Makes a Good Model?

Fit Validity High R2 Strong evidence for predictors Small SSE Simple (Parsimonious) Large F Generalizes outside sample

slide-19
SLIDE 19

Outline Refresher: The Multiple Regression Model

Why Does Parsimony Matter?

Don’t we just care about good predictions? Not exclusively...

  • We also use models to understand the world (harder with

more complexity) And even so...

  • We really care about making predictions for data we

haven’t seen yet.

slide-20
SLIDE 20

Outline Refresher: The Multiple Regression Model

Outline

Refresher: The Multiple Regression Model Defining the Model R2 and Parsimony CIs and PIs for MLR

slide-21
SLIDE 21

Outline Refresher: The Multiple Regression Model

CIs and PIs

Confidence and Prediction Intervals have same interpretation as in the single predictor case:

  • C% CI: Procedure to produce an interval at a particular

(X1, . . . , Xk) that will contain the true ˆ Y for C% of data sets.

  • C% PI: Procedure to produce an interval at a particular

(X1, . . . , Xk) that will contain the true Y for C% of “datasets plus a case”.