Week 8: Model Building 1 Partial F Test, Multiple testing, Out of - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 8: Model Building 1 Partial F Test, Multiple testing, Out of Sample Prediction Max H. Farrell The University of Chicago Booth School of Business

Modeling Building How do we know which X variables to include? ◮ Are any important to our study? ◮ What variables does the subject-area knowledge demand? ◮ Can the data help us decide? Next two classes address these questions. Today we start with a simple approach: F -testing. ◮ How does regression 1 compare to regression 2? ◮ Limitations make for important lessons. ◮ Multiple testing ◮ Always need human input! 1

Partial F Test Pick up where we left off: how employee ratings of their supervisor relate to performance metrics. The Data: Y: Overall rating of supervisor X1: Handles employee complaints X2: Opportunity to learn new things X3: Does not allow special privileges X4: Raises based on performance X5: Overly critical of performance X6: Rate of advancing to better jobs 2

> attach(supervisor) > bosslm <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6) > summary(bosslm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.78708 11.58926 0.931 0.361634 X1 0.61319 0.16098 3.809 0.000903 *** X2 0.32033 0.16852 1.901 0.069925 . X3 -0.07305 0.13572 -0.538 0.595594 X4 0.08173 0.22148 0.369 0.715480 X5 0.03838 0.14700 0.261 0.796334 X6 -0.21706 0.17821 -1.218 0.235577 Residual standard error: 7.068 on 23 degrees of freedom Multiple R-squared: 0.7326, Adjusted R-squared: 0.6628 F-statistic: 10.5 on 6 and 23 DF, p-value: 1.24e-05 3

The F test says that the regression as a whole is worthwhile. But it looks (from the t -statistics and p -values) as though only X 1 and X 2 have a significant effect on Y . ◮ What about a reduced model with only these two X ’s? > summary(bosslm2 <- lm(Y ~ X1 + X2)) Coefficients: ## abbreviated output: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.8709 7.0612 1.398 0.174 X1 0.6435 0.1185 5.432 9.57e-06 *** X2 0.2112 0.1344 1.571 0.128 Residual standard error: 6.817 on 27 degrees of freedom Multiple R-squared: 0.708, Adjusted R-squared: 0.6864 F-statistic: 32.74 on 2 and 27 DF, p-value: 6.058e-08 4

The full model (6 covariates) has R 2 full = 0 . 733 , while the second model (2 covariates) has R 2 base = 0 . 708 . Is this difference worth 4 extra covariates? The R 2 will always increase as more variables are added ◮ If you have more b ’s to tune, you can get a smaller SSE. ◮ Least squares is content fit “noise” in the data. ◮ This is known as overfitting. More parameters will always result in a “better fit” to the sample data, but will not necessarily lead to better predictions. . . . And remember the coefficient interpretation changes. 5

Partial F -test At first, we were asking: “Is this regression worthwhile?” Now, we’re asking: “Is it useful to add extra covariates to the regression?” You always want to use the simplest model possible. ◮ Only add covariates if they are truly informative. ◮ I.e., only if the extra complexity is useful. 6

Consider the regression model Y = β 0 + β 1 X 1 + · · · + β d base X d base + β d base +1 X d base +1 + · · · + β d full X d full + ε where ◮ d base is the # of covariates in the base (small) model, and ◮ d full > d base is the # in the full (larger) model. The partial F -test is concerned with the hypotheses H 0 : β d base +1 = β d base +2 = · · · = β d full = 0 H 1 : at least one β j � = 0 for j > d base . 7

New test statistic: f Partial = ( R 2 full − R 2 base ) / ( d full − d base ) (1 − R 2 full ) / ( n − d full − 1) ◮ Big f means that R 2 full − R 2 base is statistically significant. ◮ Big f means that at least one of the added X ’s is useful. 8

As always, this is super easy to do in R! > anova(bosslm2, bosslm) Analysis of Variance Table Model 1: Y ~ X1 + X2 Model 2: Y ~ X1 + X2 + X3 + X4 + X5 + X6 Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 1254.7 2 23 1149.0 4 105.65 0.5287 0.7158 A p -value of 0.71 is not significant, so we stick with the null hypothesis and assume the base (2 covariate) model. Partial- F is a fine way to compare two different regressions. But what if we have more? 9

Case study in interaction Use census data to explore the relationship between log wage rate ( log(income/hours) ) and age—a proxy for experience. Male Income Curve Female Income Curve ● 6 ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● log wage rate ● ● ● ● ● ● ● ● ● ● ● log wage rate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 3 ● ● ● ● ● 2 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 18 24 30 36 42 48 54 18 24 30 36 42 48 54 age age We look at people earning > $5000, working > 500 hrs, and < 60 years old. 10

A discrepancy between mean log(WR) for men and women. ◮ Female wages flatten at about 30, while men’s keep rising. > men <- sex=="M" > malemean <- tapply(log.WR[men], age[men], mean) > femalemean <- tapply(log.WR[!men], age[!men], mean) 3.0 M 2.8 mean log wage rate 2.6 F 2.4 2.2 2.0 1.8 20 30 40 50 60 age 11

The most simple model has E [log(WR)] = 2 + 0 . 016 · age . > wagereg1 <- lm(log.WR ~ age) 2.9 predicted log wagerate 2.8 2.7 2.6 2.5 2.4 2.3 20 30 40 50 60 age ◮ You get one line for both men and women. 12

Add a sex effect with E [log(WR)] = 1 . 9 + 0 . 016 · age + 0 . 2 · 1 [sex= M ] . > wagereg2 <- lm(log.WR ~ age + sex) M 3.0 F predicted log wagerate 2.8 2.6 2.4 2.2 2.0 20 30 40 50 60 age ◮ The male wage line is shifted up from the female line. 13

With interactions E [log(WR)] = 2 . 1+0 . 011 · age+( − 0 . 13+0 . 009 · age) 1 [sex= M ] . > wagereg3 <- lm(log.WR ~ age*sex) 3.2 M F predicted log wagerate 3.0 2.8 2.6 2.4 2.2 20 30 40 50 60 age ◮ The interaction term gives us different slopes for each sex. 14

& quadratics ... E [log(WR)] = 0 . 9 + 0 . 077 · age − 0 . 0008 · age 2 + ( − 0 . 13 + 0 . 009 · age) 1 [sex= M ] . > wagereg4 <- lm(log.WR ~ age*sex + age2) 3.0 M F predicted log wagerate 2.8 2.6 2.4 2.2 2.0 20 30 40 50 60 age ◮ age 2 allows us to capture a nonlinear wage curve. 15

Finally, add an interaction term on the curvature ( age 2 ) E [log(WR)] = 1 + . 07 · age − . 0008 · age 2 + ( . 02 · age − . 00015 · age 2 − . 34) 1 [sex= M ] . > wagereg5 <- lm(log.WR ~ age*sex + age2*sex) 3.0 M fitted F fitted 2.8 log wagerate 2.6 2.4 2.2 M data mean F data mean 2.0 20 30 40 50 60 age ◮ This model provides a generally decent looking fit. 16

We could also consider a model that has an interaction between age and edu . ◮ reg <- lm(log.WR ~ edu*age) Maybe we don’t need the age main effect? ◮ reg <- lm(log.WR ~ edu*age - age) Or perhaps all of the extra edu effects are unnecessary? ◮ reg <- lm(log.WR ~ edu*age - edu) Which of these is the best? 17

Week 8: Model Building 1 Partial F Test, Multiple testing, Out of - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 8: Model Building 1 Partial F Test, Multiple testing, Out of Sample Prediction Max H. Farrell The University of Chicago Booth School of Business Modeling Building How do we know which X variables to

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Galatians: week 3 Galatians 3:1-29 Week 1: Galatians 1:1-2:14 Week 2: Galatians 2:15-21 Week 3:

Vermont M nt Marble: A e: Americas s nt Stone Monument Sto Class S s Schedule e Week

Week 1: Christ: The Source of True Happiness Week 2: Happiness, the Gospel and Living Well Week

Islands of the Pacific Northwest One or Two Week Cruise Week 1: September 14 th 20 th Week 2:

Menu Day Week 1 Week 2 Week 3 Week 4 Monday +Pork and Apple Casserole or +Meat Loaf or Lamb

www. velpaprojects .com Finishing your property the VELPA way Time plan Week 1 - 4 Week 5 - 8

Case-X Progress Report By: MELRR Engineering Group #3 Weekly Updates Week Week Week Week

INSTRUCTION WEEK OF MAY 18 TH 2020 MS. KELLYS SIXTH GRADE GLOBAL THINKERS STUDENT OF THE WEEK:

INSTRUCTION WEEK OF MAY 18 TH 2020 MS. KELLYS SIXTH GRADE GLOBAL THINKERS STUDENT OF THE WEEK:

Energy Complex (EnCo) (New and Existing Building) 117,859 m 2 Building A 61,45 8 m 2 Building B

HOW TO APPLY HELSINKI DESIGN WEEK Week HELSINKI DESIGN WEEK Founded in 2005 and held anually in

Outdoors Adventure Trip to the South of France Rodillian Academy BBG Brayton Academy

BCC Structure Term 3 (Thursdays) Week 1 Career Expo Week 2 Pathways information

Weekly Briefing for Small Business Beth Milito and Holly Wade May 6, 2020 - Getting Back to

From MPI-1.1 to MPI-3.1, publishing and teaching, with a special focus on MPI-3 shared memory

SCANNING NEGATIVES AND SLIDES: DIGITIZING YOUR PHOTOGRAPHIC ARCHIVES Download Free Author: Sascha

COVID-19 Incident Briefing Thursday, June 11, 2020 Bill George Carbon County Public Health

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao,

Presentation schedule revisited Politeness Positive and negative politeness Form

Introduction Professor Adam Bates Fall 2016 Security & Privacy Research at Illinois (SPRAI)

3.36pt 1/54 Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil September

Week 8: Model Building 1 Partial F Test, Multiple testing, Out of - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 8: Model Building 1 Partial F Test, Multiple testing, Out of Sample Prediction Max H. Farrell The University of Chicago Booth School of Business Modeling Building How do we know which X variables to

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Galatians: week 3 Galatians 3:1-29 Week 1: Galatians 1:1-2:14 Week 2: Galatians 2:15-21 Week 3:

Vermont M nt Marble: A e: Americas s nt Stone Monument Sto Class S s Schedule e Week

Week 1: Christ: The Source of True Happiness Week 2: Happiness, the Gospel and Living Well Week

Islands of the Pacific Northwest One or Two Week Cruise Week 1: September 14 th 20 th Week 2:

Menu Day Week 1 Week 2 Week 3 Week 4 Monday +Pork and Apple Casserole or +Meat Loaf or Lamb

www. velpaprojects .com Finishing your property the VELPA way Time plan Week 1 - 4 Week 5 - 8

Case-X Progress Report By: MELRR Engineering Group #3 Weekly Updates Week Week Week Week

INSTRUCTION WEEK OF MAY 18 TH 2020 MS. KELLYS SIXTH GRADE GLOBAL THINKERS STUDENT OF THE WEEK:

INSTRUCTION WEEK OF MAY 18 TH 2020 MS. KELLYS SIXTH GRADE GLOBAL THINKERS STUDENT OF THE WEEK:

Energy Complex (EnCo) (New and Existing Building) 117,859 m 2 Building A 61,45 8 m 2 Building B

HOW TO APPLY HELSINKI DESIGN WEEK Week HELSINKI DESIGN WEEK Founded in 2005 and held anually in

Outdoors Adventure Trip to the South of France Rodillian Academy BBG Brayton Academy

BCC Structure Term 3 (Thursdays) Week 1 Career Expo Week 2 Pathways information

Weekly Briefing for Small Business Beth Milito and Holly Wade May 6, 2020 - Getting Back to

From MPI-1.1 to MPI-3.1, publishing and teaching, with a special focus on MPI-3 shared memory

SCANNING NEGATIVES AND SLIDES: DIGITIZING YOUR PHOTOGRAPHIC ARCHIVES Download Free Author: Sascha

COVID-19 Incident Briefing Thursday, June 11, 2020 Bill George Carbon County Public Health

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao,

Presentation schedule revisited Politeness Positive and negative politeness Form

Introduction Professor Adam Bates Fall 2016 Security &amp; Privacy Research at Illinois (SPRAI)

3.36pt 1/54 Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil September

Introduction Professor Adam Bates Fall 2016 Security & Privacy Research at Illinois (SPRAI)