chapter 7 introduction to linear regression
play

Chapter 7: Introduction to linear regression OpenIntro Statistics, - PowerPoint PPT Presentation

Chapter 7: Introduction to linear regression OpenIntro Statistics, 3rd Edition Slides developed by Mine C etinkaya-Rundel of OpenIntro. The slides may be copied, edited, and/or shared via the CC BY-SA license. Some images may be included


  1. Slope Slope The slope of the regression can be calculated as b 1 = s y R s x In context... b 1 = 3 . 1 3 . 73 × − 0 . 75 = − 0 . 62 Interpretation For each additional % point in HS graduate rate, we would expect the % living in poverty to be lower on average by 0.62% points. 15

  2. Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 16

  3. Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 16

  4. Intercept Intercept The intercept is where the regression line intersects the y -axis. The calculation of the intercept uses the fact the a regression line always passes through (¯ y ) . x , ¯ b 0 = ¯ y − b 1 ¯ x 70 intercept 60 50 % in poverty 40 b 0 = 11 . 35 − ( − 0 . 62) × 86 . 01 30 20 = 64 . 68 10 0 0 20 40 60 80 100 % HS grad 16

  5. Which of the following is the correct interpretation of the intercept? (a) For each % point increase in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (b) For each % point decrease in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (c) Having no HS graduates leads to 64.68% of residents living below the poverty line. (d) States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line. (e) In states with no HS graduates % living in poverty is expected to increase on average by 64.68%. 17

  6. Which of the following is the correct interpretation of the intercept? (a) For each % point increase in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (b) For each % point decrease in HS graduate rate, % living in poverty is expected to increase on average by 64.68%. (c) Having no HS graduates leads to 64.68% of residents living below the poverty line. (d) States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line. (e) In states with no HS graduates % living in poverty is expected to increase on average by 64.68%. 17

  7. More on the intercept Since there are no states in the dataset with no HS graduates, the intercept is of no interest, not very useful, and also not reliable since the predicted value of the intercept is so far from the bulk of the data. 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 18

  8. Regression line � % in poverty = 64 . 68 − 0 . 62 % HS grad 18 16 % in poverty 14 12 10 8 6 80 85 90 % HS grad 19

  9. Interpretation of slope and intercept • Intercept: When x = 0 , y is expected to equal the intercept. • Slope: For each unit in x , y is expected to increase / decrease on average by the slope. Note: These statements are not causal, unless the study is a randomized controlled experiment. 20

  10. Prediction • Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction , simply by plugging in the value of x in the linear model equation. • There will be some uncertainty associated with the predicted value. 18 16 % in poverty 14 12 10 8 6 80 85 90 % HS grad 21

  11. Extrapolation • Applying a model estimate to values outside of the realm of the original data is called extrapolation . • Sometimes the intercept might be an extrapolation. 70 intercept 60 50 % in poverty 40 30 20 10 0 0 20 40 60 80 100 % HS grad 22

  12. Examples of extrapolation 23

  13. Examples of extrapolation 24

  14. Examples of extrapolation 25

  15. Conditions for the least squares line 1. Linearity 26

  16. Conditions for the least squares line 1. Linearity 2. Nearly normal residuals 26

  17. Conditions for the least squares line 1. Linearity 2. Nearly normal residuals 3. Constant variability 26

  18. Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. 27

  19. Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. • Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. If this topic is of interest, an Online Extra is available on openintro.org covering new techniques. 27

  20. Conditions: (1) Linearity • The relationship between the explanatory and the response variable should be linear. • Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. If this topic is of interest, an Online Extra is available on openintro.org covering new techniques. • Check using a scatterplot of the data, or a residuals plot . 27

  21. Anatomy of a residuals plot � RI: % HS grad = 81 % in poverty = 10 . 3 � % in poverty = 64 . 68 − 0 . 62 ∗ 81 = 14 . 46 15 % in poverty � e = % in poverty − % in poverty 10 = 10 . 3 − 14 . 46 = − 4 . 16 5 80 85 90 % HS grad 5 0 −5 28

  22. Anatomy of a residuals plot � RI: % HS grad = 81 % in poverty = 10 . 3 � % in poverty = 64 . 68 − 0 . 62 ∗ 81 = 14 . 46 15 % in poverty � e = % in poverty − % in poverty 10 = 10 . 3 − 14 . 46 = − 4 . 16 � DC: 5 80 85 90 % HS grad % HS grad = 86 % in poverty = 16 . 8 5 � % in poverty = 64 . 68 − 0 . 62 ∗ 86 = 11 . 36 0 � e = % in poverty − % in poverty −5 = 16 . 8 − 11 . 36 = 5 . 44 28

  23. Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. 29

  24. Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. • This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data. 29

  25. Conditions: (2) Nearly normal residuals • The residuals should be nearly normal. • This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data. • Check using a histogram or normal probability plot of residuals. Normal Q−Q Plot 12 10 4 Sample Quantiles 8 frequency 2 6 0 4 −2 2 −4 0 −4 −2 0 2 4 6 −2 −1 0 1 2 residuals Theoretical Quantiles 29

  26. Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 10 8 6 80 85 90 % HS grad 4 0 −4 80 90 30

  27. Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad 4 0 −4 80 90 30

  28. Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad • Also called 4 homoscedasticity . 0 −4 80 90 30

  29. Conditions: (3) Constant variability • The variability of points around the least squares 18 line should be roughly 16 constant. 14 % in poverty 12 • This implies that the 10 variability of residuals 8 around the 0 line should be 6 roughly constant as well. 80 85 90 % HS grad • Also called 4 homoscedasticity . 0 −4 • Check using a histogram or 80 90 normal probability plot of residuals. 30

  30. Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 31

  31. Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 31

  32. Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 32

  33. Checking conditions What condition is this linear model obviously violating? (a) Constant variability (b) Linear relationship (c) Normal residuals (d) No extreme outliers 32

  34. R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . 33

  35. R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. 33

  36. R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. 33

  37. R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. • The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data. 33

  38. R 2 • The strength of the fit of a linear model is most commonly evaluated using R 2 . • R 2 is calculated as the square of the correlation coefficient. • It tells us what percent of variability in the response variable is explained by the model. • The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data. • For the model we’ve been working with, R 2 = − 0 . 62 2 = 0 . 38 . 33

  39. Interpretation of R 2 Which of the below is the correct interpretation of R = − 0 . 62 , R 2 = 0 . 38 ? (a) 38% of the variability in the % of HG graduates among the 51 states is explained by the model. 18 (b) 38% of the variability in the % of 16 residents living in poverty among the % in poverty 14 12 51 states is explained by the model. 10 8 (c) 38% of the time % HS graduates 6 predict % living in poverty correctly. 80 85 90 % HS grad (d) 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model. 34

  40. Interpretation of R 2 Which of the below is the correct interpretation of R = − 0 . 62 , R 2 = 0 . 38 ? (a) 38% of the variability in the % of HG graduates among the 51 states is explained by the model. 18 (b) 38% of the variability in the % of 16 residents living in poverty among the % in poverty 14 12 51 states is explained by the model. 10 8 (c) 38% of the time % HS graduates 6 predict % living in poverty correctly. 80 85 90 % HS grad (d) 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model. 34

  41. Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% 35

  42. Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable 35

  43. Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. 35

  44. Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. • Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%. 35

  45. Poverty vs. region (east, west) � poverty = 11 . 17 + 0 . 38 × west • Explanatory variable: region, reference level: east • Intercept: The estimated average poverty percentage in eastern states is 11.17% • This is the value we get if we plug in 0 for the explanatory variable • Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states. • Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%. • This is the value we get if we plug in 1 for the explanatory variable 35

  46. Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) is the reference level? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 36

  47. Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) is the reference level? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 36

  48. Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) has the lowest poverty percentage? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 37

  49. Poverty vs. region (northeast, midwest, west, south) Which region (northeast, midwest, west, or south) has the lowest poverty percentage? Estimate Std. Error t value Pr( > | t | ) (Intercept) 9.50 0.87 10.94 0.00 region4midwest 0.03 1.15 0.02 0.98 region4west 1.79 1.13 1.59 0.12 region4south 4.16 1.07 3.87 0.00 (a) northeast (b) midwest (c) west (d) south (e) cannot tell 37

  50. Types of outliers in linear regres- sion

  51. Types of outliers How do outliers influence the least squares line in this plot? 0 To answer this question think of where the regression line would −10 be with and without the outlier(s). Without the outliers the regression line would be steeper, −20 and lie closer to the larger group of observations. With the outliers 5 0 the line is pulled up and away −5 from some of the observations in the larger group. 39

  52. Types of outliers 10 5 How do outliers influence the least squares line in this plot? 0 2 0 −2 40

  53. Types of outliers 10 How do outliers influence 5 the least squares line in this plot? 0 Without the outlier there is no evident relationship between x and y . 2 0 −2 40

  54. Some terminology • Outliers are points that lie away from the cloud of points. 41

  55. Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. 41

  56. Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. • High leverage points that actually influence the slope of the regression line are called influential points. 41

  57. Some terminology • Outliers are points that lie away from the cloud of points. • Outliers that lie horizontally away from the center of the cloud are called high leverage points. • High leverage points that actually influence the slope of the regression line are called influential points. • In order to determine if a point is influential, visualize the regression line with and without the point. Does the slope of the line change considerably? If so, then the point is influential. If not, then it ˜ Os not an influential point. 41

  58. Influential points Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1. w/ outliers 6.0 w/o outliers log(light intensity) 5.5 5.0 4.5 4.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temp) 42

  59. Types of outliers 40 Which of the below best 20 describes the outlier? 0 (a) influential (b) high leverage −20 (c) none of the above (d) there are no outliers 2 0 −2 43

  60. Types of outliers 40 Which of the below best 20 describes the outlier? 0 (a) influential (b) high leverage −20 (c) none of the above (d) there are no outliers 2 0 −2 43

  61. Types of outliers 15 10 5 Does this outlier influence the slope of the regression 0 line? −5 5 0 −5 44

  62. Types of outliers 15 10 5 Does this outlier influence the slope of the regression 0 line? Not much... −5 5 0 −5 44

  63. Recap Which of following is true? (a) Influential points always change the intercept of the regression line. (b) Influential points always reduce R 2 . (c) It is much more likely for a low leverage point to be influential, than a high leverage point. (d) When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear. (e) None of the above. 45

  64. Recap Which of following is true? (a) Influential points always change the intercept of the regression line. (b) Influential points always reduce R 2 . (c) It is much more likely for a low leverage point to be influential, than a high leverage point. (d) When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear. (e) None of the above. 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend