Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

multivariate regression
SMART_READER_LITE
LIVE PREVIEW

Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 1 / 21 Table of Contents Multivariate Regression 1 Confidence


slide-1
SLIDE 1

Marc Mehlman Marc Mehlman

Multivariate Regression

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

Marc Mehlman (University of New Haven) Multivariate Regression 1 / 21

slide-2
SLIDE 2

Marc Mehlman Marc Mehlman

Table of Contents

1

Multivariate Regression

2

Confidence Intervals and Significance Tests

3

ANOVA Tables for Multivariate Regression

4

Chapter #11 R Assignment

Marc Mehlman (University of New Haven) Multivariate Regression 2 / 21

slide-3
SLIDE 3

Marc Mehlman Marc Mehlman

Multivariate Regression

Multivariate Regression

Multivariate Regression

Marc Mehlman (University of New Haven) Multivariate Regression 3 / 21

slide-4
SLIDE 4

Marc Mehlman Marc Mehlman

Multivariate Regression

Given multivariate variate data,

(x(1)

1 , x(1) 2 , · · · , x(1) k , y1), (x(2) 1 , x(2) 2 , · · · , x(2) k , y2), · · · , (x(n) 1 , x(n) 2 , · · · , x(n) k , yn)

where x(i)

1 , x(i) 2 , · · · , x(i) k

is a predictor of the response yi, one explores the following possible model. Definition (Statistical Model of Multivariate Linear Regression) Given a k dimensional multivariate predictor, (x(i)

1 , x(i) 2 , · · · , x(i) k ), the

response, yi, is yi = β0 + β1x(i)

1

+ · · · + βkx(i)

k

+ ǫi where β0 + β1x(i)

1

+ · · · + βkx(i)

k

is the mean response. The noise terms, the ǫi’s are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β0, β1, · · · , βk and σ.

Marc Mehlman (University of New Haven) Multivariate Regression 4 / 21

slide-5
SLIDE 5

Marc Mehlman Marc Mehlman

Multivariate Regression

Definition Given a multivariate normal sample,

  • x(1)

1 , · · · , x(1) k , y1

  • , · · · ,
  • x(n)

1 , · · · , x(n) k , yn

  • ,

the least–squares multiple regression equation, ˆ y = b0 + b1x1 + · · · + bkxk, is the linear equation that minimizes

n

  • j=1

(ˆ yj − yj)2 , where ˆ yj

def

= b0 + b1x(j)

1

+ · · · + bkx(j)

k .

Marc Mehlman (University of New Haven) Multivariate Regression 5 / 21

slide-6
SLIDE 6

Marc Mehlman Marc Mehlman

Multivariate Regression

There must be at least k + 2 data points to do obtain the estimators b0, bj’s and s2 def = n

j=1(yi − ˆ

yi)2 n − k − 1

  • f β0, βj’s and σ2, where

b0, the y–intercept, is the unbiased, least square estimator of β0. bj, the coefficient of xj, is the unbiased, least square estimator of βj. s2 is an unbiased estimator of σ2 and s is an estimator of σ. Due to computational intensity, computers are used to obtain b0, bj’s and s2.

Marc Mehlman (University of New Haven) Multivariate Regression 6 / 21

slide-7
SLIDE 7

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Confidence Intervals and Significance Tests

Confidence Intervals and Significance Tests

Marc Mehlman (University of New Haven) Multivariate Regression 7 / 21

slide-8
SLIDE 8

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Due to computational intensity – computer programs are used with multiple regression. In particular, computers are used to calculate the SEbj’s, the standard error of the bj’s. Theorem To test the hypothesis H0 : βj = 0 use the test statistic t ∼ bj SEbj ∼ t(n − k − 1) for H0. A level (1 − α)100% confidence interval for βj is bj ± t∗(n − k − 1)SEbj. Accepting H0 : βj = 0 is accepting that there is no linear association between Xj and Y , ie that correlation between Xj and Y is zero.

Marc Mehlman (University of New Haven) Multivariate Regression 8 / 21

slide-9
SLIDE 9

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example

> g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > par(mfrow=c(2,2)) > plot(g.lm) > par(mfrow=c(1,1)) Does the linear model fit?

10 15 20 25 30 −4 −2 2 4 6 Fitted values Residuals

  • Residuals vs Fitted

Chrysler Imperial Fiat 128 Toyota Corolla

  • −2

−1 1 2 −1 1 2 Theoretical Quantiles Standardized residuals

Normal Q−Q

Chrysler Imperial Fiat 128 Toyota Corolla

10 15 20 25 30 0.0 0.5 1.0 1.5 Fitted values Standardized residuals

  • Scale−Location

Chrysler Imperial Fiat 128 Toyota Corolla

0.0 0.1 0.2 0.3 0.4 0.5 −2 −1 1 2 Leverage Standardized residuals

  • Cook's distance

1 0.5 0.5 1

Residuals vs Leverage

Chrysler Imperial Maserati Bora Toyota Corolla

Marc Mehlman (University of New Haven) Multivariate Regression 9 / 21

slide-10
SLIDE 10

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example (cont.)

> summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max

  • 3.8664 -1.5819 -0.3788

1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp

  • 0.018666

0.015613

  • 1.196

0.24227 wt

  • 4.609123

1.265851

  • 3.641

0.00113 ** qsec 0.544160 0.466493 1.166 0.25362

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10

Marc Mehlman (University of New Haven) Multivariate Regression 10 / 21

slide-11
SLIDE 11

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example (cont.) And to find confidence intervals for the coefficients: > confint(g.lm) 2.5 % 97.5 % (Intercept) 9.60380809 45.05546784 disp

  • 0.01936545

0.02469831 hp

  • 0.05070153

0.01336912 wt

  • 7.20643496 -2.01181027

qsec

  • 0.41300458

1.50132521

Marc Mehlman (University of New Haven) Multivariate Regression 11 / 21

slide-12
SLIDE 12

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression

ANOVA Tables for Multivariate Regression

ANOVA Tables for Multivariate Regression

Marc Mehlman (University of New Haven) Multivariate Regression 12 / 21

slide-13
SLIDE 13

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression

Definition SSA

def

= Sum of Squares of Model =

n

  • j=1

(ˆ yj − ¯ y)2 SSE

def

= Sum of Squares of Error =

n

  • j=1

(yj − ˆ yj)2 SSTOT

def

= Sum of Squares of Total =

n

  • j=1

(yj − ¯ yj)2 MSA

def

= Mean Square of Model = SSA k MSE

def

= Mean Square of Error = SSE n − k − 1 Theorem SSTOT = SSA + SSE.

Marc Mehlman (University of New Haven) Multivariate Regression 13 / 21

slide-14
SLIDE 14

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression

Theorem (ANOVA F Test for Multivariate Regression) The test statistic for HO : β1 = β2 = · · · = βk = 0 versus HA : not H0 is f = MSA

MSE . The p–value of the above test is P(F ≥ f ) where

F ∼ F(k, n − k − 1) under H0. Statistical Software usually summarizes the calculations and conclusion above in an ANOVA table: Definition (ANOVA Table)

Source df SS MS F p–value Model k SSA MSA

MSA MSE

P(F(k, n − k − 1) ≥ f ) Error n − k − 1 SSE MSE Total n − 1 SSTOT

Marc Mehlman (University of New Haven) Multivariate Regression 14 / 21

slide-15
SLIDE 15

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression

Definition The squared multiple correlation is given by R2 def =

SSA SSTOT . The multiple

correlation coefficient is just R = √ R2. SSA measures how much of variation in the data is explained by model. By taking the ratio of SSA to the total amount of variation, SSTOT, one obtains R2, the portion of the variation that is explained by the model. In fact, R is just the correlation between the observations and the predicted values. Inflation Problem: As k increases r 2 increases, but the increase in predictability is illusionary. Solution: Best to use Definition The adjusted coefficient of determination is R2

adj = 1 −

n − 1 n − k − 1(1 − R2).

Marc Mehlman (University of New Haven) Multivariate Regression 15 / 21

slide-16
SLIDE 16

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max

  • 3.8664 -1.5819 -0.3788

1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp

  • 0.018666

0.015613

  • 1.196

0.24227 wt

  • 4.609123

1.265851

  • 3.641

0.00113 ** qsec 0.544160 0.466493 1.166 0.25362

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10

Over 80% of variation explained by the model, but it seems like only weight matters.

Marc Mehlman (University of New Haven) Multivariate Regression 16 / 21

slide-17
SLIDE 17

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression > h.lm=lm(mpg~wt, data=mtcars) > summary(h.lm) Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median 3Q Max

  • 4.5432 -2.3647 -0.1252

1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt

  • 5.3445

0.5591

  • 9.559 1.29e-10 ***
  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

Using only weight, about 75% of the variation in the model is accounted

  • for. Displacement, horsepower and quarter second times did not have

much predictive worth.

Marc Mehlman (University of New Haven) Multivariate Regression 17 / 21

slide-18
SLIDE 18

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression > anova(g.lm) Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) disp 1 808.89 808.89 117.6500 2.415e-11 *** hp 1 33.67 33.67 4.8965 0.035553 * wt 1 88.50 88.50 12.8724 0.001302 ** qsec 1 9.36 9.36 1.3607 0.253616 Residuals 27 185.64 6.88

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Since SSA = 808.89 + 33.67 + 88.50 + 9.36 = 950.42 MSA = 950.42/4 = 237.605 F = 237.605/6.88 = 34.19 P(F(4, 27) ≥ 34.19) = 3.315872e − 10

  • ne has

Source df Sum of Squares Mean Square F p Model 4 950.42 237.605 34.19 3.315872e-10 Error 27 185.64 6.88 Total 31 1,136.06

Marc Mehlman (University of New Haven) Multivariate Regression 18 / 21

slide-19
SLIDE 19

Marc Mehlman Marc Mehlman

ANOVA Tables for Multivariate Regression

Factor Analysis: One strives for the best fit (largest R2 and smallest p–value associated with the F statistic) with the fewest number of independent variables. Independent variables that are “mostly independent” of the dependent variable or highly correlated with another dependent variable can be discarded. It is an art. Doing this mechanically (on a machine) is called stepwise regression.

Marc Mehlman (University of New Haven) Multivariate Regression 19 / 21

slide-20
SLIDE 20

Marc Mehlman Marc Mehlman

Chapter #11 R Assignment

Chapter #11 R Assignment

Chapter #11 R Assignment

Marc Mehlman (University of New Haven) Multivariate Regression 20 / 21

slide-21
SLIDE 21

Marc Mehlman Marc Mehlman

Chapter #11 R Assignment

First enter into R:

> class(state.x77) # "lm" needs a data.frame not a matrix [1] "matrix" > st = as.data.frame(state.x77) # make state.x77 a data.frame > class(st) # "st" is a data.frame [1] "data.frame" > colnames(st)[4] = "Life.Exp" # no spaces in variable names > colnames(st)[6] = "HS.Grad" # no spaces in variable names 1

Do a multivariate regression with “Life.Exp” as the response variable and “Population”, “Income”, “Illiteracy”, “Murder”, “HS.Grad”, “Frost” and “Area” as explanatory variables. (a) Show that the multivariate regression linear model fits this data. (b) What is R2 and adjusted–R2? (c) Which explanatory variables are relevant at the 0.05 significance level? (d) Find 95% confidence intervals for the y–intercepts and for each of the coefficients to the explanatory variables.

2

Do another multivariate regression, but only with explanatory variables “Murder” and “HS.Grad”. (a) Show that the multivariate regression linear model fits this data. (b) What is R2 and adjusted–R2? (c) Find 95% confidence intervals for the y–intercepts and for each of the coefficients to the explanatory variables.

3

Comparing the adjusted–R2 in the above two problems, what do you conclude?

Marc Mehlman (University of New Haven) Multivariate Regression 21 / 21