Gov 2000: 9. Regression with Two Independent Variables Matthew - PowerPoint PPT Presentation

Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Fall 2016 1 / 62

1. Why Add Variables to a Regression? 2. Adding a Binary Covariate 3. Adding a Continuous Covariate 4. OLS Mechanics with Two Covariates 5. OLS Assumptions with Two Covariates 6. Omitted Variable Bias 7. Goodness of Fit & Multicollinearity 2 / 62

Where are we? Where are we going? 3 / 62 Number of Covariates in Our Regressions 10 8 6 4 2 0 Last Week

Where are we? Where are we going? 4 / 62 Number of Covariates in Our Regressions 10 8 6 4 2 0 Last Week This Week

Where are we? Where are we going? 5 / 62 Number of Covariates in Our Regressions 10 8 6 4 2 0 Last Week This Week Next Week

1/ Why Add Variables to a Regression? 6 / 62

7 / 62

8 / 62 Berkeley gender bias • Graduate admissions data from Berkeley, 1973 • Acceptance rates: ▶ Men: 8442 applicants, 44% admission rate ▶ Women: 4321 applicants, 35% admission rate • Evidence of discrimination toward women in admissions? • This is a marginal relationship. • What about the conditional relationship within departments?

Berkeley gender bias, II 28% D 417 33% 375 35% E 191 393 37% 24% F 373 6% 341 7% relationship given third variable (department). 34% 593 325 825 Men Women Dept Applied Admitted Applied Admitted C A 62% 108 82% B 560 63% 25 68% 9 / 62 • Within departments: • Within departments, women do somewhat better than men! • Women apply to more challenging departments. • Marginal relationships (admissions and gender) ≠ conditional

Simpson’s paradox 10 / 62 1 Z = 1 0 Y -1 Z = 0 -2 -3 0 1 2 3 4 X • Overall a positive relationship between 𝑍 𝑗 and 𝑌 𝑗 .

Simpson’s paradox 11 / 62 1 Z = 1 0 Y -1 Z = 0 -2 -3 0 1 2 3 4 X • Overall a positive relationship between 𝑍 𝑗 and 𝑌 𝑗 . • But within levels of 𝑎 𝑗 , the opposite.

Basic idea independent variable, 𝑌 : 𝔽[𝑍 𝑗 |𝑌 𝑗 ]. with a line: 𝑌 𝑗 , conditional on a third variable, 𝑎 𝑗 : 12 / 62 • Old goal: estimate the mean of 𝑍 as a function of some • For continuous 𝑌 ’s, we modeled the CEF/regression function 𝑍 𝑗 = 𝛾 0 + 𝛾 1 𝑌 𝑗 + 𝑣 𝑗 • New goal: estimate the relationship of two variables, 𝑍 𝑗 and 𝑍 𝑗 = 𝛾 0 + 𝛾 1 𝑌 𝑗 + 𝛾 2 𝑎 𝑗 + 𝑣 𝑗 • 𝛾 ’s are the population parameters we want to estimate.

Why control for another variable activity levels correlate with less weight? variable with more information on independent variables. but only appears to because a third variable 𝑎 causally afgects both of them. 13 / 62 • Descriptive ▶ Get a sense for the relationships in the data. ▶ Conditional on the number of steps I’ve taken, does higher • Predictive ▶ We can usually make better predictions about the dependent • Causal ▶ Block potential confounding, which is when 𝑌 doesn’t cause 𝑍 ,

Plan of atuack 1. Interpretation with a binary 𝑎 𝑗 2. Interpretation with a continuous 𝑎 𝑗 3. Mechanics of OLS with 2 covariates 4. OLS assumptions with 2 covariates: 14 / 62 ▶ Omitted variable bias ▶ Multicollinearity

What we won’t cover in lecture 1. The OLS formulas for 2 covariates 2. Proofs 𝑗 4. Hypothesis test/confjdence intervals (almost exactly the same) 15 / 62 3. The second covariate being a function of the fjrst: 𝑎 𝑗 = 𝑌 2

2/ Adding a Binary Covariate 16 / 62

Example 17 / 62 11 Non-African countries 10 Log GDP per capita 9 8 7 6 African countries 5 4 0 2 4 6 8 10 Strength of Property Rights

Basics error. rights. 18 / 62 • Ye olde model: 𝔽[𝑍 𝑗 |𝑌 𝑗 ] = 𝛽 0 + 𝛽 1 𝑌 𝑗 ▶ (𝛽 0 , 𝛽 1 ) are the bivariate intercept/slope, 𝑓 𝑗 is the bivariate • Concern: AJR might be picking up an “African efgect”: ▶ African countries might have low incomes and weak property • Condition on country being in Africa or not to remove this: 𝔽[𝑍 𝑗 |𝑌 𝑗 , 𝑎 𝑗 ] = 𝛾 0 + 𝛾 1 𝑌 𝑗 + 𝛾 2 𝑎 𝑗 ▶ 𝑎 𝑗 = 1 to indicate that 𝑗 is an African country ▶ 𝑎 𝑗 = 0 to indicate that 𝑗 is an non-African country ▶ Efgects are now within Africa or within non-Africa, not between

AJR model ## 3e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.625 on 108 degrees of freedom (52 observations deleted due to missingness) 0.1471 ## Multiple R-squared: 0.708, Adjusted R-squared: 0.702 ## F-statistic: 131 on 2 and 108 DF, p-value: <2e-16 -5.97 -0.8784 ajr.mod <- lm(logpgp95 ~ avexpr + africa, data = ajr) 5.6556 summary(ajr.mod) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3134 ## africa 18.04 <2e-16 *** ## avexpr 0.4242 0.0397 10.68 <2e-16 *** 19 / 62

Two lines, one regression 𝛾 2 𝑎 𝑗 𝛾 1 𝑌 𝑗 𝛾 2 𝑎 𝑗 ̂ 𝛾 1 𝑌 𝑗 20 / 62 ̂ • How can we interpret this model? • Plug in two possible values for 𝑎 𝑗 and rearrange • When 𝑎 𝑗 = 0 : 𝑍 𝑗 = ̂ 𝛾 0 + ̂ 𝛾 1 𝑌 𝑗 + ̂ = ̂ 𝛾 0 + ̂ 𝛾 1 𝑌 𝑗 + ̂ 𝛾 2 × 0 = ̂ 𝛾 0 + ̂ • When 𝑎 𝑗 = 1 : 𝑍 𝑗 = ̂ 𝛾 0 + ̂ 𝛾 1 𝑌 𝑗 + ̂ = ̂ 𝛾 0 + ̂ 𝛾 1 𝑌 𝑗 + ̂ 𝛾 2 × 1 = ( ̂ 𝛾 0 + ̂ 𝛾 2 ) + ̂ • Two difgerent intercepts, same slope

Interpretation of the coefficients ̂ on property rights capita between African and non-African counties conditional countries (or for two non-African countries) 0.424 increase in average log incomes for two African 𝛾 1 : A one-unit increase in property rights is associated with a property rights measured at 0 is 5.656 ̂ Intercept for 𝑌 𝑗 𝛾 1 21 / 62 𝛾 2 𝛾 0 ̂ Slope for 𝑌 𝑗 ̂ 𝛾 1 ̂ Non-African country ( 𝑎 𝑗 = 0 ) 𝛾 0 + ̂ African country ( 𝑎 𝑗 = 1 ) • In this example, we have: 𝑍 𝑗 = 5.656 + 0.424 × 𝑌 𝑗 − 0.878 × 𝑎 𝑗 • ̂ 𝛾 0 : average log income for non-African country ( 𝑎 𝑗 = 0 ) with • ̂ • ̂ 𝛾 2 : there is a − 0.878 average difgerence in log income per

General interpretation of the 𝛾 2 𝑎 𝑗 𝛾 1 -unit coefficients 22 / 62 ̂ 𝑍 𝑗 = ̂ 𝛾 0 + ̂ 𝛾 1 𝑌 𝑗 + ̂ • ̂ 𝛾 0 : average value of 𝑍 𝑗 when both 𝑌 𝑗 and 𝑎 𝑗 are equal to 0 • ̂ 𝛾 1 : A 1-unit increase in 𝑌 𝑗 is associated with a ̂ change in 𝑍 𝑗 for units with the same value of 𝑎 𝑗 • ̂ 𝛾 2 : average difgerence in 𝑍 𝑗 between 𝑎 𝑗 = 1 group and 𝑎 𝑗 = 0 group for units with the same value of 𝑌 𝑗

Adding a binary variable, visually 23 / 62 11 β 0 = 5.656 10 β 1 = 0.424 Log GDP per capita 9 8 7 6 β 0 5 4 0 2 4 6 8 10 Strength of Property Rights

Adding a binary variable, visually 24 / 62 11 β 0 = 5.656 10 β 1 = 0.424 Log GDP per capita 9 β 2 = -0.878 8 7 β 2 6 β 0 5 β 0 + β 2 4 0 2 4 6 8 10 Strength of Property Rights

Marginal vs conditional 25 / 62 11 10 Log GDP per capita 9 8 7 6 5 4 0 2 4 6 8 10 Strength of Property Rights

3/ Adding a Continuous Covariate 26 / 62

malaria) Adding a continuous variable 27 / 62 • Ye olde model: 𝔽[𝑍 𝑗 |𝑌 𝑗 ] = 𝛽 0 + 𝛽 1 𝑌 𝑗 • New concern: geography is confounding the efgect ▶ geography might afgect political institutions ▶ geography might afgect average incomes (through diseases like • Condition on 𝑎 𝑗 : mean temperature in country 𝑗 (continuous) 𝔽[𝑍 𝑗 |𝑌 𝑗 , 𝑎 𝑗 ] = 𝛾 0 + 𝛾 1 𝑌 𝑗 + 𝛾 2 𝑎 𝑗

AJR model, revisited ## 0.003 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.643 on 57 degrees of freedom (103 observations deleted due to missingness) 0.0194 ## Multiple R-squared: 0.615, Adjusted R-squared: 0.602 ## F-statistic: 45.6 on 2 and 57 DF, p-value: 1.48e-12 -3.11 -0.0602 ajr.mod2 <- lm(logpgp95 ~ avexpr + meantemp, data = ajr) 6.8063 summary(ajr.mod2) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7518 ## meantemp 9.05 1.3e-12 *** ## avexpr 0.4057 0.0640 6.34 3.9e-08 *** 28 / 62

Interpretation with a continuous Z ̂ strength of property rights 𝛾 2 : A one-degree increase in mean temperature is associated country’s mean temperature 0.406 change in average log incomes conditional on a 𝛾 1 : A one-unit increase in property rights is associated with a 𝛾 0 : average log income for a country with property rights ̂ 𝛾 1 ̂ ̂ Intercept for 𝑌 𝑗 𝛾 1 29 / 62 ̂ ̂ 𝛾 1 ̂ Slope for 𝑌 𝑗 ̂ ̂ 𝛾 0 𝛾 1 𝑎 𝑗 = 0 ∘ C 𝑎 𝑗 = 21 ∘ C 𝛾 0 + ̂ 𝛾 2 × 21 𝑎 𝑗 = 24 ∘ C 𝛾 0 + ̂ 𝛾 2 × 24 𝑎 𝑗 = 26 ∘ C 𝛾 0 + ̂ 𝛾 2 × 26 • In this example we have: 𝑍 𝑗 = 6.806 + 0.406 × 𝑌 𝑗 − 0.06 × 𝑎 𝑗 • ̂ measured at 0 and a mean temperature of 0 is 6.806 • ̂ • ̂ with a − 0.06 change in average log incomes conditional on

Gov 2000: 9. Regression with Two Independent Variables Matthew - PowerPoint PPT Presentation

Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Fall 2016 1 / 62 1. Why Add Variables to a Regression? 2. Adding a Binary Covariate 3. Adding a Continuous Covariate 4. OLS Mechanics with Two Covariates 5. OLS

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Closures & Scoping Variables Parameters Local variables Free variables

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day Choosing

Multiple Regression Analysis Independent Variables Mechanics and Interpretation of OLS

Variable Screening You will often have many candidate variables to use as independent variables

Outline Outline Several Random Variables Several Random Variables Joint

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Financial Econometrics Econ 40357 Regression review, Time-series regression Some Necessary Matrix

Perspec'vesonDarkEnergy beyondthesphericalcow RobertCaldwell Cos

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu & Norbert Egi,

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&M University

OUT-OF-ORDER EXECUTION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Tightening worst-case timing analysis of Tilera-like NoC architecture Hamdi Ayed, Jrme

Advanced Scientific Computing with R 2. Objects, Arrays, Lists, File I/O Michael Hahsler

Cylinder Theory Derivation Andrew Ning We will find the stress in a cylinder by using a free-body

Gov 2000: 9. Regression with Two Independent Variables Matthew - PowerPoint PPT Presentation

Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Fall 2016 1 / 62 1. Why Add Variables to a Regression? 2. Adding a Binary Covariate 3. Adding a Continuous Covariate 4. OLS Mechanics with Two Covariates 5. OLS

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Closures &amp; Scoping Variables Parameters Local variables Free variables

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day Choosing

Multiple Regression Analysis Independent Variables Mechanics and Interpretation of OLS

Variable Screening You will often have many candidate variables to use as independent variables

Outline Outline Several Random Variables Several Random Variables Joint

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Financial Econometrics Econ 40357 Regression review, Time-series regression Some Necessary Matrix

Perspec'vesonDarkEnergy beyondthesphericalcow RobertCaldwell Cos

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu &amp; Norbert Egi,

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&amp;M University

OUT-OF-ORDER EXECUTION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Tightening worst-case timing analysis of Tilera-like NoC architecture Hamdi Ayed, Jrme

Advanced Scientific Computing with R 2. Objects, Arrays, Lists, File I/O Michael Hahsler

Cylinder Theory Derivation Andrew Ning We will find the stress in a cylinder by using a free-body

Closures & Scoping Variables Parameters Local variables Free variables

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu & Norbert Egi,

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&M University