multivariate regression
play

Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com - PowerPoint PPT Presentation

Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 1 / 21 Table of Contents Multivariate Regression 1 Confidence


  1. Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 1 / 21

  2. Table of Contents Multivariate Regression 1 Confidence Intervals and Significance Tests 2 ANOVA Tables for Multivariate Regression 3 Chapter #11 R Assignment 4 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 2 / 21

  3. Multivariate Regression Multivariate Regression Multivariate Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 3 / 21

  4. Multivariate Regression Given multivariate variate data, ( x (1) 1 , x (1) 2 , · · · , x (1) k , y 1 ) , ( x (2) 1 , x (2) 2 , · · · , x (2) k , y 2 ) , · · · , ( x ( n ) 1 , x ( n ) 2 , · · · , x ( n ) k , y n ) where x ( i ) 1 , x ( i ) 2 , · · · , x ( i ) is a predictor of the response y i , one explores the k following possible model. Definition (Statistical Model of Multivariate Linear Regression) Given a k dimensional multivariate predictor, ( x ( i ) 1 , x ( i ) 2 , · · · , x ( i ) k ), the response, y i , is y i = β 0 + β 1 x ( i ) + · · · + β k x ( i ) + ǫ i 1 k where β 0 + β 1 x ( i ) + · · · + β k x ( i ) is the mean response . The noise terms, 1 k the ǫ i ’s are assumed to be independent of each other and to be randomly sampled from N (0 , σ ). The parameters of the model are β 0 , β 1 , · · · , β k and σ . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 4 / 21

  5. Multivariate Regression Definition Given a multivariate normal sample, � x (1) 1 , · · · , x (1) � � x ( n ) 1 , · · · , x ( n ) � k , y 1 , · · · , k , y n , the least–squares multiple regression equation , y = b 0 + b 1 x 1 + · · · + b k x k , ˆ is the linear equation that minimizes n y j − y j ) 2 , � (ˆ j =1 where def = b 0 + b 1 x ( j ) + · · · + b k x ( j ) y j ˆ k . 1 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 5 / 21

  6. Multivariate Regression There must be at least k + 2 data points to do obtain the estimators � n y i ) 2 j =1 ( y i − ˆ s 2 def b 0 , b j ’s and = n − k − 1 of β 0 , β j ’s and σ 2 , where b 0 , the y –intercept, is the unbiased, least square estimator of β 0 . b j , the coefficient of x j , is the unbiased, least square estimator of β j . s 2 is an unbiased estimator of σ 2 and s is an estimator of σ . Due to computational intensity, computers are used to obtain b 0 , b j ’s and s 2 . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 6 / 21

  7. Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 7 / 21

  8. Confidence Intervals and Significance Tests Due to computational intensity – computer programs are used with multiple regression. In particular, computers are used to calculate the SE b j ’s, the standard error of the b j ’s. Theorem To test the hypothesis H 0 : β j = 0 use the test statistic b j t ∼ ∼ t ( n − k − 1) for H 0 . SE b j A level (1 − α )100 % confidence interval for β j is b j ± t ∗ ( n − k − 1) SE b j . Accepting H 0 : β j = 0 is accepting that there is no linear association between X j and Y , ie that correlation between X j and Y is zero. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 8 / 21

  9. Confidence Intervals and Significance Tests Example > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > par(mfrow=c(2,2)) > plot(g.lm) > par(mfrow=c(1,1)) Does the linear model fit? Residuals vs Fitted Normal Q−Q 6 ● Chrysler Imperial Fiat 128 ● Chrysler Imperial ● Toyota Corolla Fiat 128 ● Toyota Corolla ● Standardized residuals ● 2 4 ● Residuals ● ● ● 2 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● −4 ● ● ● ● 10 15 20 25 30 −2 −1 0 1 2 Fitted values Theoretical Quantiles Scale−Location Residuals vs Leverage 1.5 ● Chrysler Imperial Fiat 128 ● Toyota Corolla ● Chrysler Imperial ● ● Standardized residuals ● Toyota Corolla Standardized residuals 2 1 ● ● ● 0.5 1.0 ● ● ● ● ● ● ● 1 Maserati Bora ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● 0.5 Cook's distance ● 0.0 −2 1 10 15 20 25 30 0.0 0.1 0.2 0.3 0.4 0.5 Fitted values Leverage Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 9 / 21

  10. Confidence Intervals and Significance Tests Example (cont.) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.8664 -1.5819 -0.3788 1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp -0.018666 0.015613 -1.196 0.24227 wt -4.609123 1.265851 -3.641 0.00113 ** qsec 0.544160 0.466493 1.166 0.25362 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 10 / 21

  11. Confidence Intervals and Significance Tests Example (cont.) And to find confidence intervals for the coefficients: > confint(g.lm) 2.5 % 97.5 % (Intercept) 9.60380809 45.05546784 disp -0.01936545 0.02469831 hp -0.05070153 0.01336912 wt -7.20643496 -2.01181027 qsec -0.41300458 1.50132521 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 11 / 21

  12. ANOVA Tables for Multivariate Regression ANOVA Tables for Multivariate Regression ANOVA Tables for Multivariate Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 12 / 21

  13. ANOVA Tables for Multivariate Regression Definition n def y ) 2 � SS A = Sum of Squares of Model = (ˆ y j − ¯ j =1 n def � y j ) 2 SS E = Sum of Squares of Error = ( y j − ˆ j =1 n def � y j ) 2 SS TOT = Sum of Squares of Total = ( y j − ¯ j =1 Mean Square of Model = SS A def = MS A k SS E def = Mean Square of Error = MS E n − k − 1 Theorem SS TOT = SS A + SS E . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 13 / 21

  14. ANOVA Tables for Multivariate Regression Theorem (ANOVA F Test for Multivariate Regression) The test statistic for H O : β 1 = β 2 = · · · = β k = 0 versus H A : not H 0 is f = MS A MS E . The p–value of the above test is P ( F ≥ f ) where F ∼ F ( k , n − k − 1) under H 0 . Statistical Software usually summarizes the calculations and conclusion above in an ANOVA table: Definition (ANOVA Table) Source df SS MS F p –value MS A Model k SS A MS A P ( F ( k , n − k − 1) ≥ f ) MS E Error n − k − 1 SS E MS E Total n − 1 SS TOT Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 14 / 21

  15. ANOVA Tables for Multivariate Regression Definition The squared multiple correlation is given by R 2 def SS A = SS TOT . The multiple √ correlation coefficient is just R = R 2 . SS A measures how much of variation in the data is explained by model. By taking the ratio of SS A to the total amount of variation, SS TOT , one obtains R 2 , the portion of the variation that is explained by the model . In fact, R is just the correlation between the observations and the predicted values. Inflation Problem: As k increases r 2 increases, but the increase in predictability is illusionary. Solution: Best to use Definition The adjusted coefficient of determination is n − 1 R 2 n − k − 1(1 − R 2 ) . adj = 1 − Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 15 / 21

  16. ANOVA Tables for Multivariate Regression > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.8664 -1.5819 -0.3788 1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp -0.018666 0.015613 -1.196 0.24227 wt -4.609123 1.265851 -3.641 0.00113 ** qsec 0.544160 0.466493 1.166 0.25362 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10 Over 80% of variation explained by the model, but it seems like only weight matters. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 16 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend