lecture 1 introduction to regression an example
play

Lecture 1: Introduction to Regression An Example: Explaining State - PowerPoint PPT Presentation

Lecture 1: Introduction to Regression An Example: Explaining State Homicide Rates What kinds of variables might we use to explain/predict state homicide rates? Lets consider just one predictor for now: poverty Ignore omitted


  1. Lecture 1: Introduction to Regression

  2. An Example: Explaining State Homicide Rates  What kinds of variables might we use to explain/predict state homicide rates?  Let’s consider just one predictor for now: poverty Ignore omitted variables, measurement error  How might this be related to homicide rates? 

  3. Poverty and Homicide  These data are located here: http://www.public.asu.edu/~gasweete/crj604/data/hom_pov.dta   Download these data and create a scatterplot in Stata.  Does there appear to be a relationship between poverty and homicide? What is the correlation?

  4. Scatterplots and correlations Scatterplots with correlations of a) +1.00; b) – 0.50; c) +0.85; and d) +0.15.

  5. Poverty and Homicide  There appears to be some relationship between poverty and homicide rates, but it’s not perfect.  But there is a lot of “noise” which we will attribute to unobserved factors and random error.

  6. Poverty and Homicide, cont.  There is some nonzero value of expected homicides in the absence of  poverty. ( ) 0  We expect homicide rates to increase  as poverty rates increase. ( ) 1      Thus, Y X 0 1  This is the Population Regression Function

  7. Poverty and Homicide, Sample Regression Function ˆ ˆ      y x u i 0 1 i i  y i is the dependent variable, homicide rate, which we are trying to explain. ˆ   represents our estimate of what the homicide 0 rate would be in the absence of poverty* ˆ   is our estimate of the “effect” of a higher 1 poverty rate on homicide  u i is a “noise” term reflecting other things that influence homicide rates *This is extrapolation outside the range of data. Not recommended.

  8. Poverty and Homicide, cont. ˆ ˆ      y x u i 0 1 i i  Only y i and x i are directly observable in the equation above. The task of a regression analysis is to provide estimates of the slope and intercept terms.  The relationship is assumed to be linear. An increase in x is associated with an increase in y . Same expected change in homicide going from 6  to 7% poverty as from 15 to 16%

  9. . twoway (scatter homrate poverty) (lfit homrate poverty)      .973 0.475 0 1

  10. Ordinary Least Squares     y .973 .475 x u i i i Substantively, what do these estimates mean?  -.973 is the expected homicide rate if poverty rates were  zero. This is never the case, except perhaps in the case of a zombie apocalypse, so it’s not a meaningful estimate. .475 is the effect of a 1 unit increase in the poverty rate on  the homicide rate. You need to know how you are measuring poverty. In this case, 1 unit increase is an increase of 1 percentage point. So a 1 percentage point increase (not “percent increase”)  in the poverty rate is associated with an increase of .475 homicides per 100,000 people in the state. In AZ, this would be ~31 homicides. 

  11. Ordinary Least Squares     y .973 .475 x u i i i  How did we arrive at this estimate? Why did we draw the line exactly where we did? Minimize the sum of the “squared error”, aka  Ordinary Least Squares (OLS) estimation n  ˆ  2 min ( Y Y ) i i  i 1  Why squared error?  Why vertical error? (Not perpendicular).

  12. Ordinary Least Squares Estimates n  ˆ ˆ     2 min ( ( ) y x i 0 1 i  1 i  Solving for the minimum requires calculus (set derivative with respect to β to 0 and solve)  The book shows how we can go from some basic assumptions to estimates for β 0 and β 1 without using calculus.  I will go through two different ways to obtain these estimates: Wooldridge’s and Khan’s (khanacademy.org)

  13. Ordinary Least Squares: Estimating the intercept (Wooldridge’s method)  E u ( ) 0  Assuming that the average      value of the u y x 0 1 error term is      zero, it is a ( ) 0 E y x trivial matter to 0 1 calculate β 0 ˆ ˆ      y x 0 once we know 0 1 β 1. ˆ ˆ     y x 0 1

  14. Ordinary Least Squares: Estimating the intercept (Wooldridge)  Incidentally, these last sets of equations also imply that the regression line passes through the point that corresponds to the mean of x and   the mean of y: x , y ˆ ˆ     y x 0 1 ˆ ˆ     y x 0 1

  15. Ordinary Least Squares: Estimating the slope (Wooldridge) First, we use the fact   E ( u ) 0 that the expected value of the error term ˆ ˆ      y x u is zero, to create i 0 1 i i generate a new ˆ ˆ      equation equal to u y x i i 0 1 i zero. n  We saw this before,  ˆ ˆ       1 n ( y x ) 0 but here I use the 0 1 i i exact formula used in  i 1 the book.

  16. Ordinary Least Squares: Estimating the slope (Wooldridge)   Cov ( x , u ) E ( xu ) 0 We can multiply this  last equation by x i n  ˆ ˆ       since the 1 n x ( y x ) 0 i i 0 1 i covariance between  i 1 x and u is assumed to be zero and the n  ˆ ˆ        1 ( ( ) ) 0 n x y y x x terms in the i i 1 1 i parentheses are  i 1 equal to u . n  ˆ ˆ       Next, we plug in our  x ( y y x x ) 0 formula for the i i 1 1 i  1 i intercept and simplify

  17. Ordinary Least Squares: Estimating the slope (Wooldridge) n  ˆ ˆ       x ( y y x x ) 0 Re-arranging . . .  i i 1 1 i  1 i n n   ˆ ˆ       x ( y y ) x ( x x ) 0 i i i 1 1 i   i 1 i 1 n n   ˆ      x ( y y ) x ( x x ) 0 1 i i i i   i 1 i 1 n n   ˆ     ( ) ( ) x y y x x x i i 1 i i   i 1 i 1

  18. Ordinary Least Squares: Estimating the slope (Wooldridge) Re-arranging . . .  n n     ˆ      Interestingly, the  2 x x ( y y ) ( x x ) final result leads us i i 1 i   i 1 i 1 to the relationship between covariance n      x x ( y y ) of x and y and i i cov( x , y ) variance of x. ˆ     i 1 1 n  var( ) x  2 ( x x ) i  i 1

  19. Ordinary Least Squares: Estimates (Khan’s method) Khan starts with the  actual points, and elaborates how these points are related to the squared error, the square of the distance between each point ( x n ,y n ) and the line y=mx+b= β 1 x+ β 0

  20. Ordinary Least Squares: Estimates (Khan’s method) The vertical distance between any point ( x n ,y n ), and the  regression line y= β 1 x+ β 0 is simply y n -( β 1 x n + β 0 )                  Total Error ( y ( x )) ( y ( x )) ( y ( x )) 1 1 1 0 2 1 2 0 n 1 n 0 It would be trivial to minimize the total error. We could set  β 1 (the slope) equal to zero, and β 0 equal to the mean of y, and then the total error would be zero. Another approach is to minimize the absolute difference ,  but this actually creates thornier math problems than squaring the differences, and results in situations where there is not a unique solution. In short, what we want is the sum of the squared error  (SE), which means we have to square every term in that equation.

  21. Ordinary Least Squares: Estimates (Khan’s method)                 2 2 2  SE ( y ( x )) ( y ( x )) ( y ( x )) 1 1 1 0 2 1 2 0 n 1 n 0 We need to find the β 1 and β 0 that minimize the SE. Let’s  expand this out. To be clear, the subscripts for the β estimates just refer to  our two regression line estimates, whereas the subscripts for our x’s and y’s refer to the first observation, second observation and so on.                    2 2 2 2  SE ( y 2 y ( x ) ( x ) ) ( y 2 y ( x ) ( x ) ) 1 1 1 1 0 1 1 0 n n 1 n 0 1 n 0              2 2 2 2  y 2 y x 2 y x 2 x 1 1 1 1 1 0 1 1 1 1 0 0             2 2 2 2 y 2 y x 2 y x 2 x n n 1 n n 0 1 n 1 n 0 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend